XGBoost…why we talk about it so much
Because it’s really GOOD.
Indeed, XGBoost, a gradient boosting algorithm, has been consistently employed in the winning solutions in Kaggle competitions involving structured data. XGBoost has excellent precision and adapts well to all types of data and problems, making it the ideal algorithm when performance and speed take precedence.
Before we jump into the code of this tutorial, we’ll take a closer look at the algorithm itself, and its underlying principles.
Table of contents:
I have included code in this article where it’s most instructive. Full code and data can be found on my GitHub page. Let’s get started.
What is XGBoost?
XGBoost (eXtreme Gradient Boosting) is an optimized and parallelized open source implementation of gradient boosting, created by Tianqi Chen, a PhD student at the University of Washington.
XGBoost uses decision trees (like random forest) to solve classification (binary & multi-class), ranking, and regression problems. So we’re in the area of supervised learning algorithms here.
XGBoost is part of the family of ensemble methods. The difference here from traditional methods is that instead of training the best possible model on the data, we’ll train thousands on various subsets of the training dataset, then vote for the best performing model.
In fact, before making a “difficult” decision, we commonly tend to ask for several opinions or ideas from different people around us—why not use this same method for an algorithm’s learning process? Collect several points of view on the problem, several ways of approaching it, and therefore have more information to make the final decision.
XGBoost understands a few subtleties that make it truly superior. Among them, the boosting process. The principle aim of boosting is to improve the prediction quality of a weak learner by giving more and more weight to values that are difficult to predict during learning. Thus, we force the model to improve.
However, as we say, “With great power comes great responsibility”. XGBoost has a few more parameters to tune than the average network, but we’ll see that there’s nothing too complicated when things are clearly defined.
In addition, we’ll see that the Python package provides significant assistance in terms of implementation. I’ll also try to provide a generic and reusable configuration for a variety of issues.
To be specific, XGBoost offers many advantages:
- Parallelization: By default, the algorithm uses all the cores of a machine’s microprocessor, which saves serious time.
- Regularization: XGBoost includes regularization, which helps it avoid overfitting. Thus, we’re more likely to obtain a model that performs well on a training sample, a test sample, and also on new data. Thus, it’s considered to be generalizable.
- Non-linearity: Being based on decision trees, XGBoost captures all types of data links, including non-linear ones.
- Cross validation: Integrated into the algorithm, there is no need to program it elsewhere.
- Missing data: Natively managed by the algorithm, and able to capture and understand their structure, in case they are not due to pure chance.
- Flexibility: Can be used for regression, classification and ranking use cases. In addition, it’s possible to define an objective function to optimize when training the model. This is very useful in classification when you want to give more weight to false negatives than false positives, or vice versa. This feature is also very useful in detecting fraud and in screening for serious illnesses.
- Availability & scalability: XGBoost can be used with a majority of platforms (Windows, Linux, macOS) and languages (C ++, Java, Python, Julia). It can also run in a distributed way on Apache Hadoop, Spark, and Flink, which opens the door to the processing of truly massive reams of data (come on, let’s say a bad word: “big data”).
- Save & Reload: Train & save your model & data matrix with Python.
- Tree pruning: Consists of removing branches (terminal parts) from decision trees that are of little use for prediction. Thus, we simplify the final model, and we gain in predictive performance (generalization).
The principle of boosting
A gradient boosting algorithm is a special case of boosting, but how does boosting itself actually work?
The basic idea is similar to that of bagging. Rather than using a single model, we use multiple models that we then aggregate to achieve a single result. In building models, boosting works sequentially. It begins by building a first model, which is evaluated. From this measurement, each individual model will be weighted according to the performance of its prediction.
The objective is to give greater weight to the individuals for whom the value was badly predicted for the construction of the end model. Correcting the weights as you go makes it easier to predict difficult values.
Gradient Boosting
This algorithm uses the gradient of the loss function to calculate the weights of individuals during the construction of each new model. It looks a bit like gradient descent for neural networks.
Gradient boosting generally uses classification and regression trees, and we can customize the algorithm using different parameters and functions.
The algorithm is inspired by the gradient descent algorithm. We consider a real function f(x) and we calculate the gradient to construct a sequence:
The sequence (xᵢ) converges to the minimum of the function f. We apply this to an error function from a regression problem.
Most often, we apply this method to a function F, which depends on a parameter θ :
It’s the following sequence that converges towards the minimum of the function f, so that the function f(θ, x) approximates the points (Xᵢ, yᵢ), at best.
But we could solve this problem in a space of functions and not a space of parameters:
The gradient is easy to calculate since it doesn’t depend on G. We could therefore construct the regression function G as an additive sequence of functions (Fk):
And we could construct the function (Fk) as a solution to a regression problem defined by the pairs (Xᵢ, zᵢ), with:
That how gradient boosting is defined mathematically. For more details, you can look at Krishna Kumar Mahto excellent article, where he explains the mathematics behind gradient boosting.
The dataset
Nothing fancy about the dataset—I chose a fictional dataset on various factors that will determine and predict exam scores for students at public school. The dataset incorporates scores from three exams and different features: personal, social, and financial variables that have interaction impacts upon them.
Five input variables or features are listed in the dataset. I chose the large sample, with exactly 1000 records:
- Gender
- Race/ethnicity
- Parental level of education
- Lunch
- Test preparation course
Here’s the distribution of math, reading and writing scores based on the different variables:
Training the model and converting it to an .mlmodel
We’ll train three different models, and each of them will predict a different score (math, reading, and writing).
You’ll need to create an environment and install the following packages:
We need to read the .csv file with all the data and select the right column for each and every score (the example below is for math scores):
import xgboost as xgb
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
stpData = pd.read_csv('StudentsPerformance-encoded.csv')
X, y = stpData.iloc[:,:-3],stpData.iloc[:,-3]
data_dmatrix = xgb.DMatrix(data=X,label=y)
Now that we have our dataframe using Pandas, and we’ve split the feature columns (X) and target (y), we need to create a data matrix supported by XGBoost’s called DMatrix, which is an internal data structure used by the library for memory and speed optimization.
The next step will be to split the training data and testing data (80/20), and finally train the model:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', learning_rate = 0.1,
max_depth = 7, alpha = 10, n_estimators = 1000, num_boost_round=1000)
xg_reg.fit(X_train,y_train)
Here’s an explanation for the XGBRegressor parameters:
- learning_rate : Set to control the weighting of new trees added to the model.
- max_depth: The maximum depth of a tree. By increasing this value, we’ll make the model more complex and more likely to overfit, so be careful to not abuse this variable.
- alpha : L1 regularization.
- n_estimators : The number of trees.
- num_boost_round : Actually the same as n_estimators.
Now we need to test and see how the model is doing:
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))
We have a root mean square error (RMSE) of 14.327185, which is pretty acceptable because the data is between [0, 100].
The algorithm can also predict the most important feature by the number of times it appears in a tree:
We can also plot some of the trees the model creates—this will give us an idea of the reasoning behind the algorithm:
Convert to .mlmodel file
This is probably the simplest step in all of this. coremltools has a set of methods for almost all ensemble algorithms, including decision trees and XGBoost:
coreml_model = converters.xgboost.convert(xg_reg)
coreml_model.save('student_performance_math.mlmodel')
And voila! Do the same for the other two scores, and you’ll have all the necessary models to predict the three scores.
Build an iOS Application
Now we have our project ready to go. I don’t like using storyboards myself, so the app in this tutorial is built programmatically, which means no buttons or switches to toggle — just pure code 🤗.
To follow this method, you’ll have to delete the main.storyboard and set your SceneDelegate.swift file (Xcode 11 only) like so:
var window: UIWindow?
func scene(_ scene: UIScene, willConnectTo session: UISceneSession, options connectionOptions: UIScene.ConnectionOptions) {
guard let windowScene = (scene as? UIWindowScene) else { return }
window = UIWindow(frame: windowScene.coordinateSpace.bounds)
window?.windowScene = windowScene
window?.rootViewController = ViewController()
window?.makeKeyAndVisible()
}
With Xcode 11, you’ll have to change the Info.plist file like so:
You need to delete the “Storyboard Name” in the file, and that’s about it.
Setting the layout
Set up ViewController() :
The ViewController will be used to select the input data and run inference on the models. We’ll have five pickers and a button to trigger the prediction and open the PredictionViewController():
- Gender picker view: Picking between two genders.
- Race picker view: Picking between five racial groups.
- Parent education picker view: Picking between six different education levels.
- Lunch picker view: Two different choices.
- Test preparation picker view: Picking whether the student has the test preparation course or not.
- Predict button: Will open PredictViewController(), which predicts math, writing, and reading results.
Here’s an example of a picking view. I should mention that I’ve added a UITextView() that will hold the picker view:
let genderTextField: UITextField = {
let text = UITextField()
text.text = "Gender"
text.translatesAutoresizingMaskIntoConstraints = false
text.textColor = UIColor.black
text.font = UIFont(name: "Avenir-Medium", size: 20)
text.textColor = UIColor.green
return text
}()
let gender: UIPickerView = {
let picker = UIPickerView()
picker.translatesAutoresizingMaskIntoConstraints = false
return picker
}()
As for the button, I’ve created a custom button (included in the GitHub repository) that has a selector that instantiates the PredictViewController() :
let confirmBtn: BtnVideLarge = {
let button = BtnVideLarge()
button.translatesAutoresizingMaskIntoConstraints = false
button.addTarget(self, action: #selector(buttonToPredict(_:)), for: .touchUpInside)
button.setTitle("Predict", for: .normal)
button.setTitleColor(UIColor.green, for: .normal)
button.backgroundColor = .systemBackground
button.layer.borderColor = UIColor.green.cgColor
button.layer.shadowOpacity = 0.3
button.layer.shadowColor = UIColor.green.cgColor
button.layer.shadowOffset = CGSize(width: 1, height: 5)
button.layer.cornerRadius = 30
button.layer.shadowRadius = 8
button.layer.masksToBounds = true
button.clipsToBounds = false
button.contentHorizontalAlignment = .center
button.layoutIfNeeded()
button.contentEdgeInsets = UIEdgeInsets(top: 0, left: 0, bottom: 0, right: 0)
button.titleEdgeInsets.left = 0
return button
}()
@objc func buttonToPredict(_ sender: BtnPleinLarge) {
let controller = PredictionViewController()
controller.myInputData = self.myInputData
let navController = UINavigationController(rootViewController: controller)
controller.modalPresentationStyle = .custom
self.present(navController, animated: true, completion: nil)
}
Predict and show the result
When you add the .mlmodel files, Xcode parses the model and create a class with the necessary methods to handle the prediction with the input data, and will return the target values:
func makePrediction() {
let writingModel = student_performance_writing()
let readingModel = student_performance_reading()
let mathModel = student_performance_math()
let writingPrediction = try! writingModel.prediction(gender: myInputData.gender, race_ethnicity: myInputData.race, parental_level_of_education: myInputData.parentEducation, lunch: myInputData.lunch, test_preparation_course: myInputData.test)
let readingPrediction = try! readingModel.prediction(gender: myInputData.gender, race_ethnicity: myInputData.race, parental_level_of_education: myInputData.parentEducation, lunch: myInputData.lunch, test_preparation_course: myInputData.test)
let mathPrediction = try! mathModel.prediction(gender: myInputData.gender, race_ethnicity: myInputData.race, parental_level_of_education: myInputData.parentEducation, lunch: myInputData.lunch, test_preparation_course: myInputData.test)
self.predictions.write = writingPrediction.target
self.predictions.read = readingPrediction.target
self.predictions.math = mathPrediction.target
}
Conclusion
XGBoost is known to be effective and fast on various tasks, but there are other aspects of the algorithm that can be learned by going further. For instance, you could build a testing pipeline and find the best sets of trees to optimize and maximize the accuracy of the model.
Having the ability to use such powerful algorithms with an on-device experience is quite frankly extraordinary. Core ML gives us the power to use these models without writing any lines of code.
XGBoost can also be used for classification problems—maybe that’s the next step. Try it and contact me if you had any difficulties.
Thank you for reading this article. If you have any questions, don’t hesitate to send me an email at [email protected].
Comments 0 Responses