- the data janitor
- Posts
- The Top Data Roles
The Top Data Roles
Real-World Roles for Data Professionals
Welcome everyone. š
Like many things in this space, the data roles can be very confusing. In this newsletter we are going to focus on the top data roles in the applied world. The word applied means the real-world, not academia or college. An important distinction for these roles is that they must live within the IT vertical. A business analysts works with data all day, however; that is NOT a technical role. Hereās a list of the top technical roles within the IT vertical.
Data Analyst
Data Engineer
DBA
Machine Learning Engineer
Faker Scientist
SQL Developer
ETL Developer
Most of these roles have been around for a long time. The newest are the data engineer and the faker scientist. I use the words faker scientist because this role is disappearing and is responsible for most of the failure in the artificial intelligence space.
š DATA ANALYST: The dashboard and KPI King
This one is confusing because many companies have decided to tweak the role and add additional skills to it. This has caused a lot of confusion. The worst offender is FaceBook or Meta⦠or whatever. They decided to add a list of skills to the role that were never a part of the role. You might be thinking, well roles change over time and adapt to changes in technology. My answer is simple. No they donāt.
Data analysts create dashboards and KPIs. That is what they do. They donāt create machine learning models or data warehouses.
Microsoft helped create this role decades ago so they get to help define it. Letās look at their definition.
āA data analyst enables businesses to maximize the value of their data assets through visualization and reporting tools. They're also responsible for profiling, cleaning, and transforming data. Their responsibilities also include designing and building scalable and effective data models, and enabling and implementing the advanced analytics capabilities into reports for analysis. A data analyst works with the pertinent stakeholders to identify appropriate and necessary data and reporting requirements, and then they're tasked with turning raw data into relevant and meaningful insights.ā ā Microsoft
If you sift through the fluff, what do you see? What word was used the most in their definition? Right, reporting. Data analysts create interfaces that are reporting dashboards. This is by far the best definition for a data analyst.
The core skills youāll need for this role are SQL, an interface tools like PowerBi and cloud skills. The tertiary skills youāll need are data warehouses basics and Excel. This is the only entry level data role I know about.
PROS:
You create something tangible
Very visible role
Great job security
Entry level
Pay has risen dramatically
CONS:
A ton of meetings
Work with the business types
Very visible. Mistakes are amplified
Becoming very competitive
low coin
āļø DATA ENGINEER: The companies data stewards
This job, alone with the DBA, are the top data roles in any organization. They are responsible for every facet of the companies data.
Hereās a very good definition for a data engineer.
Data engineers are responsible for designing, maintaining, and optimizing data infrastructure for data collection, management, transformation and access.
Data engineers are responsible for building and maintaining data infrastructure for optimal extraction, transformation, and loading of data from a wide variety of sources such as Azure, AWS GCP and on-prem. Here are a few core job requirements.
Always ensuring data accessibility and implementing company data policies with respect to data privacy and confidentiality.
Improving data systems reliability, speed, and performance.
Creating optimal data warehouses, pipelines, and reporting systems to solve business problems.
Data engineers spend a lot of time moving data around the organization. They create processes called CI/CD data pipelines. A CI/CD pipeline (Continuous Integration/Continuous Delivery) for data engineering automates the process of building, testing, and deploying data pipelines, ensuring consistent and reliable data processing and delivery.
ā ļø Be careful here. Youāll also see this process specific to software development. Itās the same idea, different things being moved. We are moving data, not code.
There are a ton of niches within data engineering. Some data engineers will focus on Azure. Otherās on AWS and GCP. Still others will only focus on SnowFlake and Databricks.
PROS:
Top of the food chain
Only visible to those in IT
Great job security forever
Fewer meetings
Top coin
CONS:
Not entry level
Very technical
Stressful
Fewer meetings
šDBA: The Structured Data Goats
You are equal to the data engineer. You only work with structured data. The most elite DBAs will only work with one vendor. For example, you might be a MySQL DBA or a SQL Server DBA. Letās define the job.
A database administrator, or DBA, is responsible for maintaining, securing, and operating databases and also ensures that data is correctly stored and retrieved.
In addition, DBAs often work with developers to design and implement new features and troubleshoot any issues. A DBA must have a strong understanding of both technical and business needs.
Itās not a role Iād recommend. The data engineer is newer and getās more attention. That means more money. DBAs need to know a lot and arenāt paid very well. These data stores are well designed making easy for novices to run them adequately.
For example, you could be a SnowFlake Data Warehouse professional and only know SnowFlake and basic SQL and easily make 250K. There are very few DBAs making that kind of scratch. The data warehousing professional is far less technical than the DBA.
PROS:
Top of the food chain
Highly respected in IT
Great job security forever
Very focused
CONS:
Not entry level
Very, very technical
Stressful
Low coin
š Machine Learning Engineer: The predictive analytics professional
The top job in Ai outside a few hundred Ai researchers, is the machine learning engineer. If we exclude Hinton, Goodfellow, Karparthy, LeCun, Ng, Goodfellow, Chen⦠a few others, the MLE is the top role.
Machine learning engineers work the entire machine learning pipeline. They source the data, clean it, build the models and then put that model in production.

The Core ML Pipeline
Hereās a high-level look at that pipeline.
Source raw data - Data can come from anywhere but right now, most of it comes from relational databases are data warehouse. Youāll author the queries to pull all the data you need into a single dataset.
Clean the data - Data is dirty. Youāll need to clean it. Youāll need to change every single character to numbers. Youāll apply statistical techniques to your data. Shit in, shit out. This is where the model succeeds or fails.
Build the model - This is where the model will be trained. That simply means passing your cleaned data to the model where it will look for patterns.
Make prediction - The best model wins. You feed the data to your model, you tweak and tune the knobs and buttons on your model, called hyperparamaters and then you put the best model in production.
PROS:
Top of the food chain in prediction roles
Highly respected everywhere
Great job security forever
Very visible
Great coin
CONS:
Not entry level
Very technical
Stressful
Tons of meetings
Highly visible
š Faker Scientist: The academics with lots of theory and zero technical skills
The faker scientist is supposed to be able to do all the things a machine learning engineer can yet most canāt make it past the most basic SQL interview. I use the negative title, faker scientist instead of the data scientists because of what theyāve done to predictive analytics.
Most faker scientists have advanced degrees. They know all the theory but canāt source data to save their life. Their entire training has been on toy datasets already cleansed and ready for modeling. This is one of the reasons for the massive failure of faker scientists in the real-world. Data is dirty and faker scientists have no idea how to clean it. They want someone else to do that for them.
If you havenāt read my post specific to the lies in Ai, you might not know the failure rate in the real-world on faker science projects is 96%.
Only 4% of data science projects end in the deployment of a production model. Yeah, you read that right.
Imagine being a faker scientists in the real-world and bragging how great youāre doing? This is the greatest single IT failure in the history of IT.
PROS:
You look good, dumb people are easily fooled
Highly respected everywhere, except for the IT people
Not technical
Low stress, you just sit in meetings
Great coin
CONS:
Role is dying
Ton of meetings
Highly visible
Zero IT respect
š§š¼āāļø SQL Developer: The SQL authoring experts
Authoring basic SQL code is easy. A third grader could do it. Writing stored procedures is brutally difficult, very few can do it. A stored procedure is a prepared SQL code that you can save, so the code can be reused over and over again.
SQL developers spend most of their time authoring production data access code. The uniformed see them as DBAs. They are not. They donāt have DBA skills and most DBAs donāt make good SQL developers. It is a unique niche.
Stored procedures can be thousands of lines long⦠and many are. Even smaller stored procedure can be hard to follow. Think Iām embellishing? Hereās a simple common table expression.

Lots of Fun
PROS:
Very few meetings
Highly respected
Very technical
Job security
Solid coin
CONS:
Very difficult
Highly visible
Isolated
Stressful
š¾ ETL Developer: Moving data around and around
Most ETL devs come from the DBA or SQL developer roles. ETL is an acronym and it stands for extract, transform and load.
All ETL developers will work with a tool thatās designed for moving data. For example, SQL Server comes with a tool called SSIS or SQL Server Integration Service. Hereās what that interface looks like.

SSIS Interface
Thereās a lot going on with this interface but in reality, it simply moves data from one location to another. For example, letās say I wanted to push all the companies relational databases to a data warehouse. I could do that with this tool.
Wait? Whatās the difference between these package and CI/CD pipelines? These packages move data on a scheduled basis. Pipelines often move data continuously. If one simple change is made on the source side, that change is immediately moved to the destination. These packages often move data in bulk.
PROS:
Very few meetings
Highly respected
Very technical
Job security
Solid coin
CONS:
Very difficult
Isolated
Stressful
Congrats. What data role is right for you? š
Thanks for reading and have a great day.