top of page

Part 13 Boosting Algorithms

Rodrigo Ledesma

Hello there! Welcome back to my blog, last time we dive deep into the world of ensemble methods. But there was one that we miss. Boosting ensembles. If this is the first time you visit my blog, the purpose of this series of articles is intended to create a Machine Learning model to predict how long will it take an average visitor to wait in line at a Disney or Universal park before they can ride their favorite rollercoaster and with this information, optimize their visit by scheduling parks in the day where their favorite parks are less crowded. In another post, I have already mentioned that my favorite ML book is “Hands-On Machine Learning … by Aurélien Géron. In the ensemble chapter, Aurelien gives an excellent definition of what a Boosting algorithm is. I would like to describe it as follows:



Imagine you have 3 different models, each point of data will have a weight associated with it, this weight will determine the importance that the model will give that instance. So we start by training the first model, it makes a decent job (D1) as you can see it classified correctly 2 of the blue crosses and misclassified 3 of them. The misclassified blue crosses will have an increase in their weight, giving them more importance. The next model will receive these weights and it will pay more attention to the big weights. As you see in D2 it classified correctly all blue crosses but misclassified several red lines. All errors the model makes will have an increase in its weight, making the next model pay more attention to it.

At the end, the last model will have “learned” from the errors of the past models. Making its performance better in comparison to its predecessors. One perk of boosting algorithms is that we can get to choose the base learner. For our first example, we will be able to choose which classification algorithm we want to use. There can be SVMs, trees, forests… And for this first example let’s use in our first example a technique called Ada Boost, which lets us choose which base learner we want to use.

AdaBoost Classifier Ada stands for adaptative, and what this algorithm does is start fitting a weak learner model, train it, check the misclassifications, train other models and start adding ones by one new model focusing on the errors of the last one with the purpose of turning the weak classifier into a strong classifier. So I will be training a single Decision Tree Model and comparing it with an AdaBoost ensemble classifier. So let’s get our hands dirty:



mu = ['day', 'temperature', 'month', 'humidity', 'hour', 'pressure','dayOfTheWeek']x,y=getXandY(hp_oHe)
#split the dataFrame into test and train
X_train, X_test, y_train, y_test = trainTest(x,y)
#Oversample the train dataset with SMOTE
X_train_os, y_train_os=overSampling(X_train, y_train, y_test, smote)
#define the variables order 
X_train_os_r = X_train_os[mu]
X_test_r = X_test[mu]# Single decision tree
dt = DecisionTreeClassifier(max_depth=600)
dt.fit(X_train_os_r,y_train_os)
y_pred=dt.predict(X_test_r)
print("accuracy for base model decision tree is: {}".format(metrics.accuracy_score(y_test, y_pred)))# Ensemble AdaBoost with decision tree as base learner 
adaB_class = AdaBoostClassifier(DecisionTreeClassifier(max_depth=600), n_estimators=700, learning_rate=0.9)
adaB_class.fit(X_train_os_r,y_train_os)
y_pred=adaB_class.predict(X_test_r)
ac=metrics.accuracy_score(y_test, y_pred)
print("accuracy for adaBoost with base as decision tree is: {}".format(metrics.accuracy_score(y_test, y_pred)))
Our code is extremely simple. We split the data, then we oversample only the training datasets and rearrange the features according to our analysis on Mutual Information. Then we train a single decision tree (untuned) and afterward, an AdaBoost ensemble with 700 decision trees. These are the results:
accuracy for base model decision tree is: 0.8693118134947321
accuracy for adaBoost with base as decision tree is: 0.887020847343645
Now another example, let’s use Random Forest:
mrmr = ['month', 'day', 'year', 'hour', 'minute', 'holiday', 'dayOfTheWeek', 'temperature', 'humidity', 
        'pressure', 'heavy intensity rain', 'light rain', 'broken clouds', 'scattered clouds',
        'thunderstorm with rain',
        'few clouds', 'thunderstorm', 'shower rain', 'heavy intensity rain', 'mist', 'scattered clouds']#define the variables order 
X_train_os_r = X_train_os[mrmr]
X_test_r = X_test[mrmr]rf = RandomForestClassifier()
rf.fit(X_train_os_r,y_train_os)
y_pred=rf.predict(X_test_r)
print("accuracy for base model random forest is: {}".format(metrics.accuracy_score(y_test, y_pred)))
       
adaB_class = AdaBoostClassifier(RandomForestClassifier(), n_estimators=100, learning_rate=0.9)
adaB_class.fit(X_train_os,y_train_os)
y_pred=adaB_class.predict(X_test)
ac=metrics.accuracy_score(y_test, y_pred)
print("accuracy for adaBoost with base as random fores is: {}".format(metrics.accuracy_score(y_test, y_pred)))

This has the same structure described before, let’s jump into our results:

accuracy for base model random forest is: 0.8725028058361392
accuracy for adaBoost with base as random fores is: 0.8749719416386083

For random forests, there is no real benefit. So let’s jump into another algorithm, GradientBoosting Classifiers.

GradientBoosting Classifiers Gradient Boosting has the same base as AdaBoosting, they train a set of weak learners, but this time instead of adjusting the weights of each instance, we will be using Gradient Descent to adjust the model’s performance. This has proven to have great improvements in academic papers.

This is a heavy algorithm and we need to tune two main parameters, the learning_rate and the number of estimators. I did a little tuning and also as usual I checked which encoding technique works best. Now be careful, this check took 3 days with two computers, but here are the results:

One hot encoding: best accuracy = 0.88213411649535, with 13 features, with MRMR

Manual Encoding: best accuracy = 0.875073846852988, with 7 features, with mutualInformation_classification2

Ordinal Encoding: best accuracy = 0.8789163722025912, with 9 features, with MRMR

Let me make a little parenthesis here to perform an analysis on the feature selection techniques

Conclusions on Feature Selection As we saw on post 4 there are different statistical analysis we can perform to extract the biggest amount of information from our dataset, using the smallest amount of features. Just as a recap we used the following methods to filter features:

  • Correlation

  • Variance Threshold

  • Mutual Information

  • MRMR

So far we have been testing out different algorithms and as a conclusion I can say that the methods that maximize the accuracy of my model given my dataset are mostly MRMR and Mutual Information. Let me just recap that those methods look for linear and non-linear relations on our features, while methods like Correlation can only search for linear relations.

Conclusion on Gradient Boosting Classifiers Let me just state that given the difficulty and the amount of resources consumed by the algorithm, to make my results statistically significant, I asked my function to take the average result out of only 3 loops, as I considered based on experimentation, 3 is the minimum required. Please feel free to take a look at all the results and at the complete code, as I consider it is not necessary to paste it here because it is a long function.

We have 3 different notebooks each of them has the result of analyzing the accuracy with different combinations of feature selection techniques from one hot encoding, ordinal encoding and manual encoding. And the winner of the analysis was One Hot Encoding using 13 features, with MRMR. Obtaining an accuracy of 88.21%.

Unfortunately the boosting algorithms did not boost our metrics above 90% but in the next section we will be analyzing and talking about staking algorithms before we enter the wonderful world of Deep Learning and Artificial Neural Networks.


Σχόλια


bottom of page