the data janitor
Posts
Data Engineering Direction

Data Engineering Direction

Here's What You Should Do Right Now

Mike West
April 25, 2025

In partnership with

Find out why 1M+ professionals read Superhuman AI daily.

In 2 years you will be working for AI

Or an AI will be working for you

Here's how you can future-proof yourself:

Join the Superhuman AI newsletter – read by 1M+ people at top companies
Master AI tools, tutorials, and news in just 3 minutes a day
Become 10X more productive using AI

Join 1,000,000+ pros at companies like Google, Meta, and Amazon that are using AI to get ahead.

Hello everyone and welcome to my newsletter where I discuss real-world skills needed for the top data jobs. 👏

This week I’m sharing the core steps you can take to start your data engineering journey. (Providing you are in a data role)

Not a subscriber? Join the informed.

Data Engineering: Defined 👀
DBA: Skip it 🫢
Growth: Global data 🚀
Data: Two types 🫡
ML: Structured data 📈
Pipelines: Data kind 👈

🎯 You are my target audience if…

You are a data analyst and want to move to a data engineering role.
You are considering a career in data engineering and want to learn the path you’ll need to take to reach you goal.

The advice in this post is not for non-technical roles. If you’re a business analyst or a QA professional, you are NOT IN A TECHNICAL ROLE and are starting from scratch. 😢 That means you’ll need to focus on a more entry level role like the data analyst or help-desk role.

❝

If you’re a business analyst or a QA professional, you are NOT IN A TECH ROLE and are starting from scratch.

Additionally, this post does not include the skills you’ll need to be a data analyst. Those skills are in addition to the ones I’ll be discussing here. The top skill for anyone in any of these roles is SQL. If you don’t have 2-3 years of heavy SQL on your resume you will NEVER make it in data engineering.

The data engineer has been the top job in the world of information technology for over a decade, according to Google. Ten years later, it’s still the top job. 🚀

In the world of data there are two top roles.

Data Engineer
Database Administrator

A data engineer is a professional who designs, builds, and maintains the infrastructure that allows organizations to collect, manage, and transform raw data into usable information for analysis by machine learning engineers, data analysts, and other business users.

Think of them as the plumbers and architects of the data world. They build and maintain the pipelines that transport and transform data from various sources into a structured and accessible format.

Why would you choose the data engineer over the database administrator. Here are a few reasons.

Higher pay
Lower barrier to entry
Less visible
Easier skillset
Greater demand

Any of these would alone would move my needle to the data engineering role. All of them and it’s a no brainer. 🧠

Why is there such a great demand for data engineers. I think the bar graph below does a better job of explaining why than anything I could write. The Y axis is in zettabytes. 😳

We are collecting a staggering amount of data each year. Most of the data remains unexamined or viewed by any humans. It’s just sitting there. The published prognostication of what is being used is around 13%. Imagine the jobs created with we analyze 26%?

Why is the data just sitting there? 😕 Because there aren’t enough data professionals called data engineers with the technical skills needed to build scalable solutions to house the data so that it can be analyzed.

Global Data Accumulation

Data Basics. In the real-world there are two core classes of data. There is structured data and there is unstructured data.

Structured Data: Information that is arranged in columns and rows is structured data. Think of data in an excel spreadsheet or a table in a relational database. Structured data has a predefined format, making it easy to organize, search, and analyze. It fits neatly into rows and columns with clear labels (like in a table)
Unstructured Data: Think of it like a collection of different kinds of documents, emails, or social media posts. Unstructured data does not have a predefined format or organization. It's more like free-form information that doesn't fit neatly into rows and columns.

Based on recent industry reports and studies, it's estimated that around 80% to 90% of the data we are collecting globally is unstructured data. 😯Good news for the data engineers. 🎉 Guess who stores and organizes that data? A special type of data engineer, called a machine learning operations engineer, will build pipelines that clean the data for modeling.

From an analytical standpoint, we humans need our data structured. Here’s an example from machine learning. Think of the array. It’s the only object that machine learning models understand. (most of them) The majority of machine learning models only accept their data in the form of an array. It’s the job of the machine learning engineer with the help of the data engineer, to massage that data into an array.

❝

Think of the array. It’s the only object that machine learning models understand.

This is often hard for newcomers to the field to understand. 😳 Algorithms that model unstructured data need their data to be structured. Think of computer vision or large language models. All the data they accept prior to modeling is transformed into an array.

❝

Algorithms that model unstructured data need their data to be structured. 😵‍💫

Most Large Language Models (LLMs) utilize vectors to represent text and other data. These vectors are essentially arrays of numbers, where each number represents a specific feature or aspect of the input.

Time for some trench talk. I’ve worked on three contracts over the last decade that were solely categorized as data engineering roles. On all three contracts I spent most of my time building data pipelines and babysitting data pipelines. These things break. 😖 Lots of failure points and conditions you need to be aware of and correct.

There are two kinds of pipelines. There are other flavors that are derived from these but we have two that are the root of all the others.

ETL: ETL is an acronym that stands for extract, transform and load. Data is extracted from source systems, transformed into a desired format or structure (cleaning, aggregating, filtering, etc.) in a staging area, and then loaded into a target data store (like a data warehouse).
Streaming: These pipelines process data continuously as it is generated or arrives. Data is processed in small micro-batches or individually with very low latency.

While it’s true there are jobs for those who solely focus on ETL, that architecture is slowly dying. Most companies are opting for streaming data. Ask any business professional when they need their historical data and they all same the same thing, yesterday.

HIGH-LEVEL EXAMPLES

ETL: A companies core application saves data into three different databases. The databases belong to three different vendors. One is a SQL Server, one is an Oracle database and one is MySQL. Oracle and MySQL are in AWS and SQL Server is on premise.

You need to move the data the business needs for analytics to SnowFlake for reporting. These reports are only run end of month. There’s the key. 👀 That means you need one ETL job to load SnowFlake and it needs to be a close to the run date without causing any production issues.

You might think, that sounds easy. Nope. All three of those relational databases are production and your pipeline job can’t bring them down in any way or cause stress to the system that will slow down day-to-day operations. That means you’ll need to test each connector and craft an email to the stakeholders ensuring them you won’t put any stress on these servers during the execution of your job. The job will also need to be run after hours. No sane company allows ETL jobs during core application hours.

STREAMING: The company has an order processing system and it’s fragile. They are having a hard time viewing reports after the orders are taken. The production database simply can’t handle the load during peak hours and you’ve been tasked with fixing it. You can’t touch the production server. You’re not a DBA and even if you are, the DBA who is in charge of this server isn’t going to give you anything other than read access anyhow.

So, your solution is to offload the orders into a data warehouse for order processing. Once in the data warehouse, the order can be prepared for shipping. You write a query to gather only the data you need for shipping and then build a pipeline to move that data over continuously.

You can’t use ETL because that data needs to be there minutes after an order is taken and ETL jobs often take hours to run. These jobs are often resource intensive. You’ll be using a Snowpipe to move the data over incrementally into SnowFlake. The lag is less than a minute. That means it’s takes less than a minute for the streaming job to update the orders table in the data warehouse. This solution is a great streaming solution for the company.

Now that we know what data engineer does, let’s learn about the different niches. Before we dive into the niche nuances, let’s discuss tech stacks.

In the world of IT, we over refer to tech stacks as ecosystems. A tech stack refers to the combination of programming languages, frameworks, libraries, databases, servers, UI/UX tools, and other technologies that a company or a development team uses to build and run an application or service.

I’ve spent my life working with the Microsoft ecosystem. Here’s a list of items associated with this stack

C# - Created by Microsoft
asp.net- Backend framework from Microsoft
SQL Server - Microsoft’s relational database
Window Servers - Microsoft operating system
Windows Desktop - Laptops operating system
Azure - Microsoft’s cloud
PowerBI - Microsoft’s interface tool for creating dashboards and KPIs

If you work with the Microsoft ecosystem, you never need to leave it in order to do your job. There are no other tech stacks like it. If you’re a data professional, the Microsoft tech stack has the best data tools and is the most mature.

Here’s an example of a job within the Microsoft ecosystem. This job’s focus will be mostly on Azure. The only tool that isn’t a Microsoft product is PySpark and Python.

If you’re not going to specialize on Azure, that’s fine. The first thing you’ll need to do is to pick a cloud. The majority of companies on earth have a cloud presence. The big three cloud vendors are.

Azure: Microsoft’s Cloud
AWS: Amazon’s Cloud
GCP: Google’s Cloud

Azure and AWS are way ahead of GCP. I’ve worked on a three clouds in the real-world and the worst was AWS. Their data movement tools were so bad, I quit a contract because of them. I was using DMS or data migration services. 💩

Now, many of you will conclude that my bias towards AWS is unwarranted because I work on Microsoft’s tech stack. Well, there are a lot of people out there that agree with me that don’t like or work with Microsoft products.

I’m not trying to sell you on a cloud vendor. If you’re just starting out, do your research on the tools you’ll be using. Your life as a data engineer will be complicated enough, choosing the wrong tools could make it miserable.

Focus on the big three. They are a cloud vendor, a pipelining tool and a data warehouse. Often times, when choosing a cloud vendor, you’ll need to focus on their pipeline tool and possibly their data warehouse even if it’s not the primary one you’ll be using. 😕

For example, I work with Snowflake on Azure often. Because I’m using Snowflake or a third party vendor like Fivetran for my pipelines, I’ll still need to know how Azure Data Factory works and be able to create pipelines using that tool if needed.

Another aspect of Azure… more Fabric… that I really like is their move towards simplicity or more specifically, zero code tools. Much of Fabric requires zero to no programming skills for any of their pipelines.

Cloud, Pipeline, Data Warehouse

We have plenty of background on data engineering now. Let’s discuss the best approach if you’re thinking about moving into a data engineering.

Cloud Vendor: You have three choices here but there’s only really two if you aren’t working on GCP right now. GCP isn’t going to catch up to Azure and AWS.
Pipeline Tool: Pick a tool that’s associated with the vendor you chose. If it’s Azure then you’ll need to know Azure Data Factory and the new pipeline options on Fabric. If it’s AWS, whatever horrible tools they are using these days. 😂
Data warehouse: You’ll need to know the data warehouse of the cloud you’re learning. If you’re going to go the SnowFlake route, then you’ll need to know that.
Relational database: Every single data engineering I’ve ever met knew one relational database system really well. You’ll need to have a few years of a relational database under your belt.
CI/CD: You’ll need to understand this process and be able to explain it during the interview.
Language: I’ve never worked in a data engineering role where I used it, however; Python is on just about every data engineering job I’ve looked at. That means you’ll need to know the basics of Python.

Still not sure? It’s ok. Most won’t be. It’s a big decision, there’s a shit ton of information you’ll need to ingest and apply. If you begin preparing for this or other top roles like the machine learning engineer and you aren’t overwhelmed, then you don’t understand the assignment yet.

Take your time. There is no hurry. The jobs are there and always will be. There is no saturation point in the data roles. Ai won’t be taking away any data roles.

Have a great day. 👏

PS - I do realize this post was verbose. I’ll do better in the future.

How did we do?

Great. More like it. | Not for me. | Not sure