Gradient Boosters

A Gentle Introduction to Boosting

Ready to save precious time and let AI do the heavy lifting?

Save time and simplify your unique workflow with HubSpot’s highly anticipated AI Playbook—your guide to smarter processes and effortless productivity.

Hello everyone and welcome to my newsletter where I discuss real-world skills needed for the top data jobs. 👏

This week I’m sharing a high level look into one of the world’s top models XGBoost, a gradient booster.

Not a subscriber? Join the informed. Over 200K people read my content monthly. Thank you. 🎉

🏆 It’s won more machine learning competitions than any other. It’s the top model for classification and regression problems on structured data. Nothing beats it. Not even deep learning models.

In this article let’s dive into the mechanics. No worries, no math and lots of pictures. 😂

XGBoost was primarily developed by Tianqi Chen, a researcher in machine learning. He began developing it as part of his work while pursuing a Ph.D. at the University of Washington, under the supervision of Carlos Guestrin.

Tianqi Chen released XGBoost as an open-source project, and it quickly became popular due to its speed and performance, especially in machine learning competitions like those on Kaggle.

There are two broad categories of machine learning models. There are traditional models and deep learning models. XGBoost is a traditional model. XGBoost is an ensemble model. In machine learning, an ensemble model is a technique that combines the predictions from multiple individual machine learning models to produce a single, more robust, and often more accurate prediction. XGBoost is a gradient booster.

Let’s unpack that last paragraph. 😅

  • XBBoost is a traditional model. It’s not a deep learning model.

  • XGBoost belongs to a class of models called gradient boosters.

  • Gradient boosters are models that combine with other models to produce a better results.

  • Models that combine with other models to produce a better result are called ensemble models.

Gradient boosters are used by every cloud vendor for all their structured data modeling. While the exact model(XGBoost), may not be used, the underlying model is a gradient booster. For example, Amazon has SageMake, Google has VertexAi and Microsoft has Azure ML.

Several years ago Google’s AutoML Tables(VertexAi) was using gradient boosters to model their regression and classification problems and then tweaking the models with hyperparmeters determined by deep learning models. It was backtested on every structured data competition involving classification and regression and it placed in the top three on every competition.

In machine learning vernacular, a feature is the same thing as a column. An attribute is also the same thing as a column. The rows are called observations.

Interestingly, Google noted the only reason it lost was due to feature engineering. Feature engineering is the process of creating, transforming, or selecting data features to help a machine learning model make better predictions.

For example, a column that is date of birth is transformed into something model ready like age. Do you know anyone who spends a lot of their time talking about the important of data cleansing? 🙂 Yes, feature engineering is part of data cleansing and data cleansing is where you succeed or fail the entire project.

ℹ️ The top two problems in the real-world are classification and regression problems.

Gradient boosting is a powerful ensemble machine learning technique used for both regression and classification tasks. It builds a strong predictive model by combining the predictions of multiple simpler, weaker models (typically decision trees) in a sequential manner. A weak model is a weak learner.

Let’s unpack that last paragraph.

  • Gradient boosters excel on regression and classification tasks.

  • A weak learner is a model used at the base model for the ensemble.

  • Most of the time, the weak learner(model) is a decision tree.

What’s a decision tree? A decision tree is a visual tool that maps out possible outcomes of a series of related choices, helping to weigh actions against their costs, probabilities, and benefits. It's essentially a flowchart that branches out based on different decisions, similar to a tree with branches.

A decision tree is a visual tool that maps out possible outcomes of a series of related choices

The picture within the picture below is a decision tree. It’s from the famous Titanic dataset. In the example below, the decision tree was created by weighing all the data points in the dataset and determining the data point with the most impact on the model. In this case, the most important attribute (column) is gender. This tree is very simple. What is the percentage of men versus women who died on the Titanic? That one decision is our weak learner. It’s one model, a decision tree, with one answer. It’s one iteration of the entire ensemble model. It’s one part of the sequence.

With gradient boosters, the model is fit on consecutive trees and at every step. The word fit is a synonym for trained. The goal is to solve for the net error from the prior tree. Boosting is all about teamwork. Each decision tree, the weak learner, dictates what features the next model will focus on. When an input is misclassified, its weight is increased so that next decision is more likely to classify it correctly. 

With gradient boosters, the model is fit on consecutive trees and at every step.

The picture below helps illustrate the process.

  • First iteration of the model, a decision tree is executed and one piece of information is learned. We are at the top of the model with one tree.

  • There’s a ton of error because only one tree was executed. Can’t learn much after one iteration or run.

  • After the second iteration or run, more information is learned and the error is reduced.

  • This process continues until all the decision trees added together find the best possible solution to the problem.

  • Each of these trees is called a weak learner. That’s just the vernacular used.

Gradient Booster in Action

Gradient boosting involves three main steps. 

The first step that is required is that a loss function be optimized. The loss function must be differentiable. A loss function measures how well a machine learning model fits the data of a certain phenomenon. Different loss function may be used depending on the type of problem.  

A loss function measures how well a machine learning model fits the data of a certain phenomenon.

Different loss function can be used on speech or image recognition, predicting the price of real estate, and describing user behavior on a web site. The loss function depends on the type of problem. For example, regression may use a squared error and classification may use logarithmic loss. I know… borderline a little too much math. 😄 

The second step is the use of a weak learner. In gradient boosters, the weak learner is a decision tree. Specifically regression trees are used that output real values for splits and whose output can be added together, allowing subsequent models outputs to be added to correct the residuals in the predictions of the previous iteration.

In gradient boosters, the weak learner is a decision tree.

The algorithms for classification problems and for regression problems use a different algorithm, however, they both use the same approach for splitting the data into groups. That approach is regression decision trees. Even classification problems use regression decision trees. In regression decision trees, the final answer is a range of real numbers, this makes it’s relatively simple to split the data based on the remaining error at each step.

Steps are taken to ensure the weak learner remain weak yet is still constructed in a greedy fashion. It is common to constrain the weak learners in sundry ways. Often, weak learners can be constrained using a maximum number of layers, nodes, splits or leaf nodes.

The third step is combing many weak learners in an additive fashion. Decision trees are added one at a time. A gradient descent procedure is used to minimize the loss when adding trees. That’s the gradient part of gradient boosters. Gradient descent optimization in the machine learning world is typically used to find the parameters associated with a single model that optimizes some loss function.

A gradient descent procedure is used to minimize the loss when adding trees. That’s the gradient part of gradient boosters.

In contrast, gradient boosters are meta-models consisting of multiple weak models whose output is added together to get an overall prediction. The gradient descent optimization occurs on the output of the model and not the parameters of the weak models.

Let’s look at this process pictorially now that we have an high-level understanding of the components. Below we can see that gradient boosting adds sub-models incrementally to minimize a loss function. Earlier we said that gradient boosting involved three main steps. In our example below the weak learner being used is a decision tree. Secondly, the trees are added sequentially. Lastly, the error of the model is being reduced.

This article provided you with a high-level look at what components are involved with gradient boosting.

Thanks for watching and have a great day. 👏

Heads up!! SQL Server Hyper Focus, my GPT for helping you learn foundational concepts in SQL Server is nearly finished. 🎉

This is actually a course on the top technical questions most often seen in interviews for data professionals who will be working with SQL Server. It will be one of the first interactive courses created using a GPT.

No more watching boring courses and trying to stay awake. You’ll be able to use all the powerful features of ChatGPT to assist you with learning SQL Server. This is the next level of interactive learning. 

This is a fully customized learning experience. Just tell SSHF (SQL Server Hyper Focus) what you need. Flash cards. Done. Quizzes. Done. More detail that isn’t provided by the course. Done.

I’ve trained the model to use spaced repetition in concert with all the tools it has to help you learn. If you aren’t familiar with spaced repetition then:

what is spaced repetition 

Actually, let’s just ask ChatGPT how it can help you learn with SSHF. 

Game changer is used often in todays world. However, it’s pretty easy to see that LLMs will change a lot of things and eduction is one of them.

Stay tuned.