The Machine Learning Engineering Playbook

Here is the Approach I'd Recommend

In partnership with

Learn AI in 5 minutes a day

This is the easiest way for a busy person wanting to learn AI in as little time as possible:

  1. Sign up for The Rundown AI newsletter

  2. They send you 5-minute email updates on the latest AI news and how to use it

  3. You learn how to become 2x more productive by leveraging AI

Hello everyone and welcome to my newsletter where I discuss real-world skills needed for the top data jobs. 👏

This week I’m sharing the core steps you can take to start your machine learning engineering journey. (Providing you are in a data role)

Not a subscriber? Join the informed. Over 200K people read my content monthly. Thank you. 🎉

Hello everyone. đŸ‘

  • Defined: Machine Learning Engineer 👀

  • Research: Mostly Useless 😙

  • Pipeline: The Template ⚙️

  • Process: 4 Steps 👣

  • Boards: Job Reqs 📢

  • Python: The gold standard 🐍

  • Skills: The core 🏆

  • Prompt: Nice assist 💁‍♂️

  • FREE: 3 courses 🆓

The machine learning engineer is the top role in all of Ai. I’m excluding the elite few researchers who created the models we use today… like Hinton and Tianqi Chen. Geoffrey Hinton is the father of deep learning and Tianqi Chen created XGBoost, the top model used in most machine learning. 

Most research is useless. The same applies to Ai. Only a handful of people ever create anything usable. The actual number is around 1%. Yes, that means 99% of all research is useless. This often upsets people, but it shouldn’t. If you want to work in machine learning, you don’t need a PhD or need to rise to the pinnacle of researcher đŸ™„ in order to become a highly paid and well respected machine learning engineer.

A machine learning engineer works the entire end-to-end machine learning pipeline. Most machine learning engineers come from other data roles. Why? Because most machine learning is done on structured data.

Even when the data is not structured, the machine learning engineer will still need to massage the data into an array like structure because it’s the only structure machine learning models understand.

High-Level ML Pipeline

Here’s the machine learning process at a very high level. 

  • Source Data - The data is most often found in structured data stores like relational databases or data warehouses. The data must be removed from the data stores. This is often done via SQL.

  • Cleanse Data - The data is raw. It needs to be cleansed. That means applying statistical techniques to it.

  • Model Data - Next, the clean data is fed to the model. This is called fitting. The model will be tuned and tweaked until the best one is found. 

  • Production - Once the best one is found, it will be moved to production to be consumed.

Now that we have a high-level understanding of the role, let’s peruse a few job posts. These are live jobs. When you begin looking at roles keep in mind you are looking at the technical skills. Soft skills simply aren’t that important. You’ll also often see companies list domain skills. For example, x years in healthcare or banking. You can ignore these also. The technical skills are king. đŸ‘‘

🚦Deep learning models excel on unstructured data problems like text, video, images and audio. Traditional models excel on data that structured. For example, if the data was sourced from a relational database and the problem is classification or regression, the top model class are gradient boosters. Classification and regression problems account for around 90% of real-world machine learning.🚦

Job Post from Indeed

After I read over the requirements, often called reqs, I noticed something immediately. The company doesn’t know what they are doing. This is common. Most companies don’t. Notice the box I’ve highlighted at number one. It reads, deep learning models using large datasets. đŸšˇ

The top models on structured data problems aren’t deep learning models. Now, I wouldn’t correct them until I was offered the role, but the new hire should start using gradient boosters. They are superior in terms of performance, on structured data problems.

❝

The top models on structured data problems aren’t deep learning models.

Wait? Where did they tell you what kind of data they were modeling? See the bullet point with the number 2? The data you will be modeling is housed in Snowflake. It’s the only data store they mention. They also tell you the deep learning framework they are using, PyTorch. There are two leading deep learning frameworks, PyTorch and Tensorflow. If they force you to use PyTorch, it’s a single line of code calling the model so you should be fine with that.

Sorry, that was a lot of stuff. 👀 Recall our machine learning pipeline? We just answered the first step. You will be sourcing data housed in Snowflake, a data warehouse that’s installed on AWS. (It can be installed on AWS, GCP or Azure) You also know you’ll be modeling structured data problems. We need to keep this real-world fact in mind, there are no entry level jobs in machine learning. You need to right size your expectations.

What skills do you need for this role… so far? You’ll need to know the basics of AWS, Amazon’s cloud. You’ll need to know SQL, the only language Snowflake speaks and you also know you’ll be using the deep learning framework called PyTorch… until you start and show them gradient boosters are superior. 😂

Let’s take a look at the requirements near the two. It lists PyTorch, which is the deep learning framework they are using. Next is Python, the gold standard in applied machine learning. You’ll need to know a good deal of Python.

❝

Next is Python, the gold standard in applied machine learning. You’ll need to know a good deal of Python.

They also list out a few libraries, SciKit-Learn and Pandas. SciKit-Learn is used for all the tools you’ll need, like the various metrics. Pandas is used for working with small datasets. Here’s a list of things you’ll need to know. This list is a fantastic foundation for your career in machine learning. It’s where you should spend 99% of your time learning.

  • SQL - The language used where all the data is stored.

  • Data warehouse - Location where the data is stored. Speaks SQL.

  • Python - The langue used for data cleansing and modeling.

  • Libraries - The prepackaged code used for machine learning.

  • Frameworks - The main structures used for deep learning.

  • Cloud - The place where most of the models are built.

Let’s assume you are in a data role right now. Why? Because that’s the only path to a job in machine learning that I know of in the real-world. What about a Python programmer? Yes, that will works also, as long as you have 3-5 years of heavy SQL on your resume. If you can’t source the data, how are you going to clean and model it? You aren’t.

As a data analyst, you will have heavy SQL, you should be working with a data warehouse and you should have cloud experience. If not, find some way at your current role to attain these skills or get a new job.

The hard part will be Python and the libraries. You’ll need a solid year of studying and learning Python before you apply to any role as a data analyst trying to make a transition to machine learning. You’ll need the same amount of time with the libraries. The libraries are a lot more involved than most think. You’ll use Pandas and SciKit-Learn more than any framework or any other library so this should be your focus.

It’s too bad there are no valid Python certifications. That would really help. There are no library certs either. So, that means you’ll need a very good command over the Python language and these libraries.

Your focus should be on the foundation. Most interviews will start off with foundation questions. If you are having a hard time understanding what foundation questions are do this. Head to a prompt and ask basic interview questions. Here is one for SQL Server.

Here’s one for the Pandas library. These are fantastic. My Amazon interview was heavy Pandas and all three of these questions were asked on initial tech discussions.

The only one that may be a little tricky is SciKit-Learn. In the applied space, we don’t use it for modeling, we use it for all the tools it has, like metrics. Just keep this in mind.

These responses are not only great for technical interviews, they provide you with an outline for what you need to know and what to study. 👏

Currently, the machine learning engineer is one of the hardest careers to break into. If you aren’t sure whether it’s for you, I have several free YouTube courses where you can whet your appetite. These are completely free and I promise have more real-world insight that 99% of all the other courses you’ll ever take.

Thanks for watching and have a great day. 👏

How did we do?

Login or Subscribe to participate in polls.