My Learning Path Toward Becoming a Machine Learning Engineer
A step-by-step guide for helping you work toward building a machine learning engineer skillset
About two years ago, I began this learning path toward becoming a data scientist. Through some great advice from different mentors along the way, I cobbled together the learning path detailed here in this post. Each person had different things to share along the way, often catering well to their personal learning styles. Some jived with me quite well, while others… not so much. For as much as I write, I’m ironically not a great reader. So when somebody recommended I read a book to learn a certain skill, I sought an alternative resource more suited to my learning style.
Along the way, the machine learning engineer (MLE) role was formalized in the company I work for, and that added some stuff on top of my already data scientist-inclined path. It fortunately ended up working out really well since I was already in more of an engineer-like role prior to entering this new MLE role. But had I known that I’d be taking a more MLE-inclined path at the outset, I might not have taken things in the order that I did.
And so friends, that’s what this post is here for! Keep in mind, before I began this journey, my technical skill level was fairly minimal. The extent of my technical experience was as a helpdesk specialist in undergrad and an internship building Microsoft Access databases and interfaces. That’s… pretty minimal in terms of a machine learning skillset! So really, I started from almost nothing. I mean it. Go look at my LinkedIn profile. Pretty much all my professional experience and formal education degrees are business-oriented. All that to say… if I can do it, you can too. I’m sure many of you checking out this post are already general software developers or data analysts, so in a sense, you’re already ahead of where I started!
Okay, that’s a tiny bit about me and my history! Before we get into the meat of this learning path, let’s take a minute to talk about how this post is structured, including why I broke things down the way that I did and why I recommend the respective learning materials further on down.
Why I Selected These Recommended Courses
Remember when I said I’m not a great reader? Before actually sharing the learning materials I recommend, I want to share at a general level the three principles I look for when selecting learning materials.
- Interactivity: I’m a big believer in getting hands on experience to learn any skill. Because every domain here is a software-oriented domain, every area offers opportunity to get your hands dirty with things like codes and datasets. Some are a little more structured than others, but I’m pretty sure every single course recommended has some sort of interactive activity built in somehow.
- Self-Paced Learning: As a full time employee and dad to two little girls, flexibility is a must-have in almost anything I do in life. Even before my daughters were born, I did one semester in an MBA program onsite during evenings after work, and I found that to be an absolute nightmare. Additionally, I find that I learn at a much faster pace than a traditional university environment offers. Fortunately, the Internet has really streamlined education in this sense, and I’m happy to share that every recommended course in this learning plan holds to a self-paced learning model.
- Cost Effectiveness: Don’t get me on my soapbox about the cost of traditional higher education. Given that typical Master’s degree program runs well over $15,000, I can assure you that you don’t need to break your bank to learn these skills! I paid for most of these things out of my own pocket, and they were MUCH cheaper than my Master’s degree. Many of the courses are even free. The most expensive it gets is with those Udacity nanodegrees. Generally speaking, I loved the way they structured their content that I found it worth the cost, and they’re still WAY cheaper than traditional higher education.
Keeping those principles in mind, you won’t find any books on this list just because I don’t learn most efficiently from books. Don’t get me wrong, I do love to read, but when it comes to these software-oriented things, I think it’s just a lot easier to watch a video followed by a hands-on interactive lab. And considering how much great content there is on the Internet these days, traditional higher education just isn’t worth the cost. But if you find some means is more effective to your learning style than what I recommend here, by all means pursue that. After all, that’s kind of what I did with iterating upon the original path handed to me!
How This Learning Path Is Structured
With those principles out of the way, let’s talk about how I generally divided this plan into separate domains and a general order for going about tackling them. Speaking of order… I’m going to be a little more firm and loose at times when recommending the path to go since it’s not a particularly linear path. I’ll try to make this as clear as possible as we continue forward.
Alrighty, let’s talk about the four domains I separated things into. We’ll give a general overview of them here and then cover them off more deeply in forthcoming sections.
- Data Engineering: Data, data , data. Data is naturally at the heart of machine learning since machine learning learns patterns from that data, but data often doesn’t come out-of-the-box ready. This set of courses focuses on how to build skills in the data engineering space, including introducing you to programming languages like Python and SQL, how to clean data appropriately for predictive algorithms, and how to structure data in things like relational and non-relational databases.
- Machine Learning: Because machine learning is so dependent on data / feature engineering, this is the one domain that I definitely recommend holding off on until you get through the data engineering content. Once you do get through data engineering, I think you’ll find learning about machine learning as a whole is actually not too bad and sort of fun. Software libraries like Scikit-Learn and Tensorflow have really abstracted away a lot of the “painful” details of common machine learning algorithms, so understanding theory here is going to be more important than anything. Actually building the predictive model with something like Scikit-Learn is a piece of cake. (Given that your data is appropriately engineered, of course!)
- Software Engineering: More specific toward an MLE role as opposed to a pure data scientist role, the MLE works to enable the proper infrastructure and tools needed to make machine learning happen in a software environment. This domain is definitely important, but it’s not necessarily dependent on the previous two to begin learning. If you’re already a software developer, you may already begin to skip over a lot of this content. We cover a lot of ground in this domain, including basic stuff around Git and CI/CD, containerization through Docker and Kubernetes, and cloud engineering on platforms like Amazon Web Services (AWS) and Google Cloud Platform (GCP).
- Math: I know, what a fancy name! So I’m going to be 100% honest… you don’t really need to delve much into this domain to actually get up and going in the machine learning space. As I mentioned before, software libraries largely abstract away the mathematical algorithms for you. So why am I including it in here? Simply put, if you’re in this game for the long haul, understanding the behind-the-scenes math will make you a better engineer in knowing the most ideal algorithm for a specific scenario. Like the software engineering domain, this math domain has no hard dependencies on the other three domains.
Phew, that was a long introduction! Still, I hope it gives you a general sense of what you can expect and how you might be able to chunk down your own development journey more easily. Trust me, it can seem very daunting in the beginning, so the more manageable the parts that sum up to the whole, the better. As the old saying goes, you eat an elephant by taking one bite at time. (Umm…. but let’s not eat any elephants. My baby girls would not like that.)
One final, FINAL note before jumping into this list of courses: I am not being sponsored in any sort of manner to promote any of these platforms. I appreciate them all for various reasons, and I hope you see that each is given a fair critique for both its positive and negative attributes along the way.
In the words of comedian Pete Holmes: Let’s. Get. Into it!
Domain 1: Data Engineering
Okay, let’s jump into our first domain! I’m going to say that the path I’ve set forth is going to be a little linear. There will be some things you might be able to do out of order, but if you’re not sure what those are, I’d recommend sticking to the order in which I’ve laid things out. Wasting no more time, here are my recommendations.
- Introduction to Python Programming (Platform: Udacity | Cost: Free): The name pretty much tells you everything you need to know. This relatively short course will help jump you into the world of Python: a programming language you will be come VERY familiar with in almost every subsequent course.
- SQL for Data Analysis (Platform: Udacity | Cost: Free): I must say, of all the free courses I’ve ever taken on Udacity, this one is easily the best. Just like the previous one introduces you to Python, this one introduces you to SQL from the ground up. You may not use SQL as much as you’ll use Python, but it’s certainly important to know in the data engineering world.
- Data Analysis with R (Platform: Udacity | Cost: Free): Depending on the environment you end up working in, you may find people leveraging the R language over using Python. Admittedly, most of the resources mentioned in the rest of this post lean more toward leveraging Python, but I still think it’s good to get a basic education in R.
- Data Analyst Nanodegree (Platform: Udacity | Cost: $899 for 4 months access): And so, we come to our first non-free course. Now, I personally am wrapping up my sixth nanodegree as I publish this, and I must say… the data analyst nanodegree was easily the one that brought me the most value in a single course. They run the full gamut with what you’ll learn here: Numpy, Pandas, data visualization with Matplotlib, API data extraction, Beautiful Soup, Regex, data cleaning best practices… I’m pretty sure I’m missing others, too! Being one of their more mature nanodegrees, the content is structured really, really well. If you choose to do only one nanodegree, make this one your choice.
- Data Engineer Nanodegree (Platform: Udacity | Cost: $1099 for 5 months access): Where the data analyst nanodegree focused on working in and amongst the data itself, the data engineer nanodegree focuses more on the proper storage, structure, and movement of data. It covers things from relational and non-relational schema design, how to leverage big data processing with Spark, how to automate workflows with Apache Airflow, and more. Now here’s my caveat… when I first took this in April 2019, it was still sort of in beta, and it was clear a lot of kinks need ironed out yet. I’m hoping they’ve updated it by now because there really is a lot of great content here. (But given that data manipulation is more directly important for machine learning, I still place stronger importance on the data analyst nanodegree.)
- AWS Certified Big Data Specialty (Platform: Udemy / WhizLabs | Cost: ~$60 total): This is naturally only appropriate for people interested in leveraging AWS for data stuff, but I figure that includes a lot of you reading this post. Now I know, this designation is actually being deprecated in March 2020, but it’s not exactly going away. Two new specialty exams (Data Analytics Specialty and Database Specialty) are forthcoming, and I bet much of this “AWS Big Data” content will carry over into those new exams. For video learning, I highly recommend Sundog Education’s Udemy course. As much as I love A Cloud Guru, I just think this particular course is better suited for this subject. And because I like to study with practice exams, I suppose I recommend WhizLabs. Honestly… I’ve seen better practice exams for other AWS certifications, but… they’re kind of the best of the lot for the Big Data Specialty. *shrugs*
Domain 2: Machine Learning
As I mentioned before, I definitely don’t recommend jumping into ANY course in this domain until you’ve gotten through at least the “data analyst nanodegree” content from the Data Engineering domain. You’ll quickly find out how important data engineering is as a pre-requisite of sorts to enabling machine learning.
This domain is going to be a little odd. It’ll be fairly linear, but I’m going to offer some “choose your own path” options in here. I personally have taken all of these and found them all to be beneficial in specific regards, but I also understand where you might opt to not take all these. I also will call out one resource I do NOT recommend at this time, particularly because I feel like some people might notice its suspicious absence. I’ll touch on that at the end of this section.
- Intro to Machine Learning Nanodegree (Platform: Udacity | Cost: $699 for 3 months access): Here comes our first “choose your own path” option. When I first took this nanodegree, it was actually bundled as “Term 1” in the two-term “Data Scientist” nanodegree. Since then, it has been separated from Term 2 (now just called “Data Scientist”) and given its own nanodegree. I really liked how the content of this nanodegree was structured. It provided simple, concise explanations of many ML algorithms, and it of course provided a lot of hands on opportunities. But… it also covers a lot of the same ground as our next option, and the next option is a fraction of the price. I still got a lot of value out of this one, so I still felt it was worth mentioning.
- Machine Learning A-Z: Hands-On Python & R in Data Science (Platform: Udemy | Cost: < $20 with sale / coupon code): Okay… so I have a love / hate relationship with courses like this one. What I love: they cover a ton of ground at a very, very low cost. What I hate: they almost cover too much ground. Like I mentioned with Udacity’s Intro to Machine Learning nanodegree, they were really simple and concise. This course goes a little deep. BUT it also covers some ground the Udacity nanodegree doesn’t, like additional algorithms and how to do everything in R. All this at a fraction of the price as the nanodegree. I did the Udacity nanodegree prior to this course, and I personally found a lot of value in both. But if you’re cost conscious, I don’t blame you if you only choose this option.
- Intro to TensorFlow for Deep Learning (Platform: Udacity | Cost: Free): Okay, so we’ve covered our basic ML bases, but what you’ll want to ensure that you at least dip your toe in the deep learning pool. Some of these other resources mentioned do touch on deep learning here and there, but just to really make sure you get exposed to it, I’m recommending this resource. While TensorFlow is obviously emphasized and is important on its own, this course also covers general deep learning knowledge. (If you really want to go deeper on the deep learning topic, I’m sure you can find plenty of resources that will take you further than this course.)
- Data Scientist Nanodegree (Platform: Udacity | Cost: $899 for 4 months access): I know this learning path is more inclined toward an MLE role, but I still found this particular nanodegree to cover enough new content that it’s worth recommending. I’m referring to things like how to create and publish a Python package to PyPi, how to build a recommendation model with collaborative filtering, and exercising your skills in a hefty capstone project. So yeah, even for somebody with their heart set on an MLE role, this nanodegree is still pretty relevant.
- AWS Certified Machine Learning Specialty (Platform: A Cloud Guru / Udemy | Cost: ~$50): Again, this is going to be more inclined toward people looking to use AWS. This particular exam exited beta in March 2019, so it’s still relatively now. Now I’ll be honest: I only have taken the A Cloud Guru course but not the Sundog Education one on Udemy. This is because the latter hadn’t yet launched when I took my test in September 2019. But I’m still recommending it for 3 reasons: a) the A Cloud Guru course is okay but could use some extra oomph, b) I really liked Sundog Education’s Big Data Specialty course, and c) I do know people who have taken this exact course and liked it. The test is hard enough and courses cheap enough that I still recommend looking at both!
- WHAT I SUSPICIOUSLY DO NOT RECOMMEND… Machine Learning Engineer Nanodegree (Platform: Udacity | Price: $699 for 3 months access): Dun dun dun!!! I know… scandalous! Now why would the guy creating an MLE plan and singing the praises of other Udacity nanodegrees NOT include this particular nanodegree?? It’s got “Machine Learning Engineer” right in the title! The reality is, this nanodegree was updated semi-recently to focus primarily on AWS SageMaker, and it’s done so in a very limited capacity. (Like… it doesn’t even cover custom containers.) It’s not a bad course, but its very misleading to call this nanodegree “Machine Learning Engineer” given how narrow its new scope is. I guess it used to cover more ground before its update, but in its current form, I simply cannot recommend this nanodegree for its price. Sorry, Udacity! Can we still be friends…?
Domain 3: Software Engineering
Alright friends… this area was a tough cookie to crack. I considered splitting this domain a couple different ways, but every time I did, it always ended up just not making any sense. At the end of the day, all these really do roll up under the general heading of “Software Engineering.” But truthfully, many of these things are decoupled from one another, as far as dependencies go. You can largely get by with taking these out of order, and as I mentioned toward the top of this post, you can do almost all of these in tandem with the other three domains. But hey, if you really have no idea in which order to go, I recommend sticking to the order I’ve laid things out.
- Command Line Basics (Platform: Udacity | Cost: Free): Whether using Terminal on a Mac or Git Bash on Windows, you will need to become intimately familiar with how to use a command line for an MLE role. Fortunately, it’s actually a lot easier than you’d think. I personally used to watch TV shows like Mr. Robot and get frozen from confusion any time I saw a command line on screen, but you’ll quickly find it’s not nearly as daunting as it might seem. This free course is very helpful to get you up and going with the basics, and you can easily knock it out in an hour or so.
- Version Control with Git (Platform: Udacity | Cost: Free): Chances are high that you will end up collaborating with others on a software development project, so Git is absolutely necessary to learn. Heck, even if you don’t end up working with anybody, you’ll still want to learn Git to publish your portfolio to GitHub! This free course does a pretty good job at introducing you to the basics of git, including repository initialization, code commits, branching, and branch merging.
- GitLab CI/CD (Platform: Udemy | Cost: < $20 with sale / coupon code): Okay, so some people are understandably going to argue with me on this one. I think everybody would agree learning CI/CD is important, but I’m not sure people would have me point you to GitLab to learn this. They may instead point you to a course with Jenkins instead. Still, I’m including this for two reasons: a) it’s growing in popularity amongst many large companies, and b) you’re going to get exposed to a second, different method for CI/CD in the containerization course. Since I have personal experience with this course, I’d still recommend it even for general CI/CD learners since it does a pretty good job at teaching these concepts even at a non-GitLab level.
- Designing a RESTful API (Platform: Udacity | Cost: Free): Confession… I kind of haven’t taken this course…? What I mean is that most of Udacity’s nanodegrees come with free bonus content, so I actually learned the content of what this course teaches from there. That bonus content is generally based on their free content, so I think this is the right free course…? The description seems to indicate so. Anyway, knowing how to build APIs is important, especially when building them using Flask within Python. This course says it covers Flask, so I’m trusting it’ll teach you the same stuff that I learned!
- Docker & Kubernetes: The Complete Guide (Platform: Udemy | Cost: < $20 with sale / coupon code): One of the biggest topics currently gaining traction in the software engineering world is this idea of decoupling monolithic software into containerized microservices, and of course, the most popular tools to help do this are Docker and Kubernetes. With over 21 hours of video content, this course is indeed long, but it really is worth going through the whole thing. The instructor does a great job at explaining everything both theoretically and with hands-on labs. And as I mentioned above, you will get exposed to CI/CD practices again here with GitHub and Travis CI.
- AWS Certified Solutions Architect - Associate (Platform: A Cloud Guru / Udemy | Cost: ~$50 total): As you might recall, I already mentioned it might be worth your time to pursue the respective AWS Big Data and Machine Learning Specialty designations. What I didn’t mention is that you should really begin here (or even the lower Cloud Practitioner) when jumping in to learning about the cloud. This particular designation gives you a flavor on cloud engineering on a more general level, which I think is important for you to understand if you’re going to be doing machine learning engineer work on AWS. For video courses, I easily recommend A Cloud Guru, and for reinforcing practice tests, I recommend this pack of practice exams by Jon Bonso you can find on Udemy.
- Google Cloud Platform Associate Cloud Engineer (Platform: A Cloud Guru / Udemy| Cost: ~$40): I know we’ve talked about AWS a lot, but I think it’s good to get educated on a multi-cloud level. Not only will you naturally learn a lot about Google Cloud Platform, but they place a very strong emphasis on Kubernetes here, reinforcing what you’ll learn in the other course above. Once again, our friends at A Cloud Guru come to the rescue with a great video course, and I’ve also attached a link to some practice exams.
Domain 4: Math
Well friends, we’re on the home stretch toward the end! Now, here’s the thing about this particular domain… you can theoretically “scrape by” as a machine learning engineer WITHOUT touching this domain at all. The reason is that machine learning algorithms have been abstracted so well that you really don’t need to learn the underlying math to be able to utilize them. Still, I’d argue that if you want to be successful in the long run, you need these skills under your belt. Period. That said, this domain is not dependent on the others at all, and I might suggest that you hold off on this domain until you’ve gotten solid experience in the others. I personally have found it easier to learn some of the things here knowing how they directly tie to concepts in the other domains!
Now, I’m going to focus on the things you’ll need to know to become a solid machine learning engineer, but if you’re like me, I expect that you probably dropped math back when you graduated from high school. If you need to extend backward to refresh yourself on Algebra I or Trigonometry, do it. Khan Academy can provide a lot of help for you in that refresh space.
- Introduction to Descriptive Statistics (Platform: Udacity | Cost: Free): Just like it sounds, this course will give you a nice baseline introduction to descriptive statistics. It also serves well as a precursor to its next sister course…
- Introduction to Inferential Statistics (Platform: Udacity | Cost: Free): After building foundational knowledge on descriptive statistics from the course above, this course will begin to gently ease you into some of the math that powers the ML algorithms you learned about in the machine learning domain.
- LAFF: Linear Algebra - Foundations to Frontiers (Platform: edX | Cost: Free): Linear algebra is pivotal to understand as it is the engine behind things like recommendation systems using collaborative filtering. Some things you’ll want to keep in mind if you take this course: first, they open/close on the same cadence as a regular university semester. It’s still self-paced, but it’s not like you can start whenever you like unfortunately. And second, the course emphasizes learning via this software called MATLAB. There’s nothing wrong with MATLAB, but when they don’t really tell you is that they have an alternative set of the exact same exercises done in Jupyter Notebooks with Python. Given that you’ll be working more with these latter tools as an MLE, I’d recommend doing the exercises with them instead of MATLAB.
- Differential Calculus (Platform: Khan Academy | Cost: Free): Ah, yes, the magic of a derivative. I don’t have much to say other than that this is a good course. Just be sure to freshen up on prerequisite math if you need to. Khan Academy is probably the best math platform on the Internet, and all their material is free.
- Integral Calculus (Platform: Khan Academy | Cost: Free): Same basic idea here as the one above. Once you start to work your way through these courses, you’ll begin to see how it applies to machine learning via things like gradient descent. In that sense, I personally think it’s more valuable to go through the machine learning domain and learn the theory behind the algorithms first. It’ll just help solidify your knowledge on this math working “behind the scenes” when you finally come to learn it.
- Multivariable Calculus (Platform: Khan Academy | Cost: Free): Okay, tiny confession… at the time of this post’s publication, I actually haven’t taken this particular course yet. I’m still recommending it as it came recommended to me by data scientist friends, and my general experience with Khan Academy has been awesome. One last random note: using the Khan Academy iPad app with an Apple Pencil is a phenomenal experience. They’ve enabled it so you can pull up a writing interface for any practice problem you work on. No fussing around with paper!
Well friends, we’ve finally come to the end of this post! It’s been a doozy, but I hope it puts you on a path toward success as you look to build your machine learning engineer skills. As appreciative as I am for the mentors that helped me ultimately form this list, it would have been nice if I could have had this exact list packaged for me like so when I was starting off.
One final thing to keep in mind: I wouldn’t say this list is the “end all, be all.” What I mean is that it will certainly give you all the skills to jump into a machine learning engineer role, but if you don’t continuously learn in specific, nuanced categories beyond this guide, you probably won’t have long term success in your role. Let this guide serve as a solid starting point for your career, and I’m sure it’ll become evident to you how you want to draw out your specialty over time.
Thanks for checking out this post! Best wishes to you all in your learning journey.