the data janitor
Posts
Demystifying Data Cleansing

Demystifying Data Cleansing

Let's Discuss Data Cleansing

Mike West
April 09, 2025

Welcome Everyone 👏

What’s inside this newsletter?

Introduction: Data is dirty
Data Cleansing: Machine Learning
Process: Three Core Steps
Titanic Dataset: First steps

In the real-world data is dirty. It’s not a little dirty, it’s filthy. Most applied machine learning is data cleansing. A very famous Ai researcher, Andrew Ng said it’s around 85% of the job in the real-world of anyone building predictive models. 👀

As an aside, prior to data cleansing we need to source our data. That’s another article. 🙂 Sourcing means identifying where the data lives now and how you are going to get that data to cleanse it.

In tabular data, there are many different statistical analysis and data visualization techniques you can use to explore your data in order to identify data cleaning operations you may want to perform. There are many types of errors that exist in a dataset, although some of the simplest errors include columns that don’t contain much information and duplicated rows.

❝

Approximately 90% of all models are sourced from structured data stores like databases and data warehouse.

Let’s get started by defining DATA CLEANSING specific to machine learning. Yes, there is another kind of data cleansing and it applies to the work done by data analysts. No, data analysts do NOT build machine learning models. Data analyst also spend a lot of time cleaning their data but in a very different way.

Here is a very simple definition of data cleansing for machine learning. Data cleaning refers to identifying and correcting errors in the dataset that may negatively impact a predictive model.

❝

Data cleaning refers to identifying and correcting errors in the dataset that may negatively impact a predictive model.

This is a really important point. All machine learning models are monolingual. They speak one language and that is numbers. I’m telling you that EVERY single machine learning model on earth only understands numbers. 👍

⚠️ We are only talking about the very basics here. Data cleansing is very involved. I created a course that’s around 3 hours long that details a framework and has real-world examples of data cleansing. It’s not an entry level course. It’s going to require you know Python, be well grounded in statistics and have a solid background in machine learning concepts. While it’s not easy, it’s what you’ll be doing as a machine learning engineer in the real-world, therefore; you need to know it.

Data cleansing at a high-level includes three steps.

Choose Data
Preprocess Data
Transform Data

STEP ONE: Data Selection

👀 All machine learning models are built on data. So, prior to modeling we need some data. You’ll need to select a subset of data from your dataset. There is always a strong desire for including all data that is available, however; this isn’t the best plan. Why? Cleaning smallish datasets is easy. Cleaning at scale, not so easy.

You need to consider what data you actually need to address the question or problem you are working on. Make some assumptions about the data you require and be careful to record those assumptions so that you can test them later if needed.

Below are some questions to help you think through this process:

What is the extent of the data you have available? For example through time, database tables, connected systems. Ensure you have a clear picture of everything that you can use.
What data is not available that you wish you had available? For example data that is not recorded or cannot be recorded. You may be able to derive or simulate this data.
What data don’t you need to address the problem? Excluding data is almost always easier than including data. Note down which data you excluded and why.

It is only in small problems, like competition or toy datasets where the data has already been selected for you. This is one reason I discourage Kaggle competitions for those starting out. The hardest part of machine learning has been done for them.

STEP TWO: Data Preprocessing

After you have selected the data, you need to consider how you are going to use the data. This preprocessing step is about getting the selected data into a form that you can work.

Three common data preprocessing steps are formatting, cleaning and sampling:

Formatting: The data you have selected may not be in a format that is suitable for you to work with. The data may be in a relational database and you would like it in a flat file, or the data may be in a proprietary file format and you would like it in a relational database or a text file.
Cleaning: Cleaning data is the removal or fixing of missing data. There may be data instances that are incomplete and do not carry the data you believe you need to address the problem. These instances may need to be removed.
Sampling: There may be far more selected data available than you need to work with. More data can result in much longer running times for algorithms and larger computational and memory requirements. You can take a smaller representative sample of the selected data that may be much faster for exploring and prototyping solutions before considering the whole dataset.

It is very likely that the machine learning tools you use on the data will influence the preprocessing you will be required to perform.

❝

Cleaning data is the removal or fixing of missing data.

STEP 3: Data Transformation

The final step is to transform the process data. The specific algorithm you are working with and the knowledge of the problem domain will influence this step and you will very likely have to revisit different transformations of your preprocessed data as you work on your problem.

Three common data transformations are scaling, attribute decompositions and attribute aggregations. This step is also referred to as feature engineering.

Scaling: The preprocessed data may contain attributes with a mixtures of scales for various quantities such as dollars, kilograms and sales volume. Many machine learning methods like data attributes to have the same scale such as between 0 and 1 for the smallest and largest value for a given feature. Consider any feature scaling you may need to perform.
Decomposition: There may be features that represent a complex concept that may be more useful to a machine learning method when split into the constituent parts. An example is a date that may have day and time components that in turn could be split out further.
Aggregation: There may be features that can be aggregated into a single feature that would be more meaningful to the problem you are trying to solve. For example, there may be a data instances for each time a customer logged into a system that could be aggregated into a count for the number of logins allowing the additional instances to be discarded. Consider what type of feature aggregations could perform.

You can spend a lot of time engineering features from your data and it can be very beneficial to the performance of an algorithm. Start small and build on the skills you learn.

Let’s look at a famous dataset. This is the Titanic dataset. It’s used for those who are learning machine learning. The target variable, that thing we are trying to predict has been added. It’s named survived. The column has a 0 or a 1. That means this is called a binary classification problem. We are trying to group people into two categories. Those who survived and those who died. We are going to build a model that will predict a person survival rate had that person been aboard the Titanic.

Our dataset is below. If I were starting the process of data cleansing the first thing that caught my attention is the PassengerID column. If the goal of every machine learning model is to find patterns, a monotonically increasing key from a database won’t help us and can easily be removed.

Titanic Dataset Sample

In the sample below we removed the PassengerID. There you go. You just completed your first data cleansing step. You’ve removed a column that won’t be needed for modeling. If you were thinking… what about the name column. Isn’t that mostly unique? You would be correct. The next low hanging column we can remove is the name column. Names are unique so it’s the next to go.

PassengerID Removed

Removing columns is the easiest part of data cleansing. The point of data cleansing is to apply statistical technique to your data. This will end up with the data set that’s ready to be modeling looking something like this.

NOTE: You may be thinking. Where is the target variable… that thing we are trying to predict? We don’t want to give the model the answer. We want the model to look at the data and find patterns without any assistance.

Clean and Model Ready

😊 You know what data cleansing is specific to machine learning. You learned it’s most of the work of an applied machine learning engineer. You also learned the three core steps to working with data. Lastly, you started thinking about the process of data cleansing. Start thinking with the end goal at mine before you start.

Try this. I want the most important features(columns) to my model with the least amount of features for my model. The larger you search space (number of columns), the harder it will be for your model to find the patterns it needs.

We could go on but we are getting to that point in article where most just fall asleep or close their laptop. So, let’s conclude here. You’ve learned a lot in this short post.

Here’s are some take aways.

Data in the real-world is dirty and incomplete
Prior to data cleansing you need to source that data
Most data currently resides in structured data stores
The majority of machine learning is data sourcing and cleansing
Data cleansing for machine learning is quite involved
Data cleansing involves removing and fixing all data errors
Machine learning models do poorly on poorly cleansed data
All the data for your models will need to be numerical
There are three high-level steps to data cleansing

Thanks for reading. 👏