the data janitor
Posts
Top Libraries in Machine Learning

Top Libraries in Machine Learning

Learn What the Professionals Use Daily

Mike West
April 30, 2025

Start learning AI in 2025

Keeping up with AI is hard – we get it!

That’s why over 1M professionals read Superhuman AI to stay ahead.

Get daily AI news, tools, and tutorials
Learn new AI skills you can use at work in 3 mins a day
Become 10X more productive

Hello everyone and welcome to my newsletter where I discuss real-world skills needed for the top data jobs. 👏

This week I’m sharing the core libraries used in applied machine learning. 👀

Not a subscriber? Join the informed. Over 200K people read my content monthly.

Thank you. 🎉

Contents:

Defined: Machine Learning
Python: The king in machine learning 👑
NumPy: Array 😙
Pandas: Data 🐼
SciKit-Learn: Swiss Army Knife 🗡️
XGBoost: Traditional model king 👑
matplotlib: Visualization 👀
TensorFlow: Frameworks 🖼️

The machine learning engineer is the top role in all of Ai. A machine learning engineer works the entire end-to-end machine learning pipeline. In the applied space, machine learning is a four step process. ⚙️

You source the data for your model. You clean the data. You model that cleansed data. You put that model in production. Two of these steps are often completely Python and the core libraries. Sourcing is often done via SQL. Putting the model in production will depend on how the model is presented to the users.

Contrary to popular belief, machine learning engineers do not code their own models. They use well-vetted, open sourced reliable ones created by the top names in the Ai space. The top models are libraries and frameworks.

🏅 The top language in the world of machine learning is Python. There are a few reasons for it.

Simplicity - While it’s not a hard and fast rule, the lower the barrier to entry a programming language has, often the more it will be used. Python is simple. Python might be the highest-level language out there. That means just about anyone can learn it.
Libraries - A library in Python is a group of pre-bundled code you can import into your environment to extend the language’s functionality. There are libraries for just about every aspect of applied machine learning.
Jupyter Notebook - Jupyter Notebooks are a powerful way to author your code in Python. A Jupyter Notebook is a web-based interface that allows for rapid prototyping and sharing of data-related projects.

The success of the Jupyter Notebook hinges on a form of programming called literate programming. Literate programming is a software development style created by Stanford computer scientist, Donald Knuth. This type of programming emphasizes a prose first approach where human-friendly text is punctuated with code blocks. It excels at demonstration, research, and teaching objectives especially for science.

The simplicity, readability, libraries and integrated development environment make Python one of the most used languages in the machine learning space.

The library was briefly defined above. However, let’s give some more attention. A library in Python is a group of pre-bundled code you can import into your environment to extend the language’s functionality.

The first library that often used is NumPy. NumPy, short for Numerical Python, is a cornerstone library for scientific computing in Python. It provides robust support for large, multi-dimensional arrays and matrices, along with a comprehensive collection of high-level mathematical functions to operate efficiently on these data structures. All machine learning models are built on numerical data in the shape of arrays. The array or matrix is the foundation object in machine learning. The tensor in Tensorflow is an array.

In the code segment below we are using the import statement to bring in two libraries into our session. The Pandas and the NumPy libraries. The word import means to bring in. The words as pd and as np are simply aliases. This will allow us to type less code.

The pandas library is built on top of NumPy arrays. Pandas is a powerful and widely-used open-source Python library specifically designed for data manipulation, analysis, and cleaning. Built upon the foundation of the NumPy library, pandas provides intuitive data structures and a rich set of functions to make working with structured data, such as tabular data (like spreadsheets or SQL tables) and time series, both easy and efficient.

The DataFrame in Pandas is widely used data structure and is a standard way to store data. A DataFrame is data aligned in rows and columns like the SQL table or a spreadsheet database. We can either hard code data into a DataFrame or import a CSV file, tsv file, Excel file, SQL table, etc. We can use the below constructor for creating a DataFrame object.

We can also create a DataFrame by importing a CSV file. The values within the record are separated using the comma character. Pandas provides a useful method, named read_csv() to read the contents of the CSV file into a DataFrame. For example, we can create a file named cities.csv containing details of American cities. The CSV file is stored in the same directory that contains Python scripts. This file can be imported using:

📚PROMPT HOMEWORK

what is a csv file
what is a library in Python
what the difference between a csv file and an excel file
why are csv files used so often in machine learning
what are the main data structures in pandas

The next library and another one you will use every day is SciKit-Learn. Scikit-learn (often referred to as sklearn) is a widely-used, free, and open-source machine learning library for the Python programming language. It is a fundamental tool for faker scientists and machine learning practitioners, providing a comprehensive suite of algorithms and utilities for tackling various machine learning tasks.

What’s really important to understand is that SciKit-Learn is not widely adopted because of the models. It’s so ubiquitous because of tools for model evaluation and selection and for tools preprocessing. It has everything you’ll need for data cleansing, the single most important and time consuming task of a machine learning engineer. Let’s discuss the code block below.

The from is only adding the specific part of the library needed, not the entire library. This is considered very pythonic but has no real relevance other than that.
sklearn is the alias for SciKit-Learn.
SciKit-Learn has datasets you can import for free to learn data cleansing and modeling.
Notice that train_test_split is being imported from models selection within SciKit-Learn.
We are importing metrics.
In the final line of code, a variable named iris is being created and the iris datasets is being loaded into our environment.

📚 PROMPT HOMEWORK

what does train_test_split do from sklearn
what are metrics from sklearn
what does a single equals sign mean in Python
what does a double equals sign mean in Python
what is kfold

I’d pay attention to those prompts above. Everyone was an interview question at a top tech company. There are softball questions. That means, if you get them wrong the interview is over and the company moves on to the next candidate. These are the simplest interview questions you are going to receive.

The next library is XGBoost. It’s the gold standard in the real-world for building models on structured data. Most problems in the real-world fall into two categories. They are either classification or regression problems.

XGBoost, which stands for Extreme Gradient Boosting, is an open-source library that provides an efficient and effective implementation of the gradient boosting algorithm. It has gained immense popularity in the machine learning community due to its high performance, speed, and ability to handle complex datasets.

In the code block below XGBoost has been imported into our environment. Can you tell what kind of model we are building? Right, a classification model because we are importing the xgbclassifier and not the xgbregressor.

XGBoost became so popular on competition sites that a top competition site called Kaggle stopped having structured data competitions because XGBoost was winning all of them.

I authored a book many years ago that I never published. Here’s an excerpt from it on specific to gradient boosters. XGBoost is a gradient booster.

Gradient boosting is a machine learning technique for regression and classification problems, which produces a model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

If you want me to create a post about how they work at a high level I’ll put a poll tother and if there’s enough interested, I’ll pull the highlights from my book.

Would you like an article on the basics of gradient boosters?

We need to visualize our data before we clean or model it. The Godfather of data visualization in Python for machine learning is matplotlib. Matplotlib is a widely used, comprehensive library for creating static, animated, and interactive visualizations in Python. It serves as a fundamental tool for data visualization within the Python ecosystem, particularly in scientific computing and data analysis.

Here’s an entire code block you can run if you have access to a Jupyter notebook. We are importing a library called numpy and a module of the matplotlib library called Pyplot. Pyplot is a module within the widely used Python plotting library. Acting as a scripting layer, Pyplot offers a collection of functions that directly modify the current figure and plotting area.

When it’s time to model images, audio, text or other unstructured data, the best model choice is deep learning. The top two models in the real-world are TensorFlow and PyTorch. Every place I’ve worked at used TensorFlow and I believe it’s still quite a bit ahead of PyTorch for real-world deep leaning modeling.

TensorFlow is a widely-adopted open-source platform for machine learning and artificial intelligence developed by the Google Brain Team. At its core, TensorFlow is designed for numerical computation, particularly well-suited for building and training deep neural networks. TensorFlow supports a variety of programming languages, with Python being the primary interface.

❝

When it’s time to model images, audio, text or other unstructured data, the best model choice is deep learning.

Several years ago, TensorFlow added Keras; A high-level API that simplifies the process of building and training neural networks, making TensorFlow more accessible to beginners.

Below is a code block of TensorFlow with Keras. Take note that we don’t need to import the Keras library because it’s a part of TensorFlow. That means we can call all the Keras code directly from within TensorFlow. Keras is a library that lives outside of TensorFlow and part of core TensorFlow. This is a little confusing.

In TensorFlow, particularly within the tf.keras API, a Sequential model is a way to build neural networks layer by layer in a linear fashion. It's the simplest type of model in Keras and is suitable for networks where the data flows directly from one layer to the next without branching or complex connections.

📚PROMPT HOMEWORK

what is kears
what does sequential mean in the context of defining a deep learning model
What does dense mean in the context of defining a deep learning model
what is a layer in a neural network

We’ve covered a lot. Here’s what we learned.

Python - The gold standard in applied machine learning.
NumPy - The array. The object machine learning models understand.
Pandas - The data king. Used to massage tabular datasets.
SciKit-Learn - The Swiss Army knife. We use it for all the tools we need for data cleansing and modeling. Don’t need the models, nothing beast XGBoost.
XGBoost - The structured data king. The world’s top model for classification and regression problems on structured data.
matplotlib - The godfather of visualization. We need to see our data before we can model it.
TensorFlow - The current king of deep learning.

Thanks everyone and have a great day. 👏

How did we do?

Keep them coming. | Not for me. | Not sure