Part 12 Trees and Ensemble methods

Rodrigo Ledesma
Jun 22, 2022
8 min read

Welcome back to a new post! If you have been reading my blog, thank you! I really hope you have enjoyed it, or at least learned something. In the last post, we used Support Vector Machines with different kernels to fit our dataset’s needs, and the best part is that we achieved an accuracy of over 85%. Not bad at all.

In this post, we will be analyzing different types of tree models together with ensemble methods for classification purposes, using our harry potter and the forbidden journey waiting times dataset.

Link to google colab for decision trees: https://colab.research.google.com/drive/1W0ZEPaw2MIjXWzAKcGiqruPkQqfU5uTX?usp=sharing

Decision Trees

This is an extremely popular and also easy to understand algorithm to create a classification model. In contrast with the other algorithms we have seen so far, decision trees allow the user to have a visual representation of how the decision process runs, and exactly how the variables are treated to classify them. That is one reason why DT are very popular for Explainable AI.

The magic behind decision trees relies on a technique called “best split”. This method analyzes each feature and calculates the entropy or the amount of expected information. And the algorithm will start splitting on the feature with the lowest entropy. Lets have a little example:

We have taken some notes regarding the physical characteristics of some aquatic animals, and we want to train a model to recognize only killer whales (which are not technically whales…). So we measure the length of the animals, how many teeth they have, whether or not they have gills, and beak. From the 10 animals in the study, 5 of them were actual killer whales, and those are represented with a “p” meaning positive, the rest are represented with an “n” of negative.

The first 5 are killer whales the last 5 are other animals.

In order to start constructing our Decision Tree, we need to look for the best feature or variable to split or create the first branch, for this we will calculate the entropy of each feature.

Entropy calculation

I want to make this very easy for you, we can think of entropy as the amount or quantity of randomness in a system. How random the information in a variable is. In this case the word random will be understood as how pure the information is. We will be looking for the feature which gives us the biggest amount of information redatding our target variable.

In this specific case we will be looking for a variable whose data can split as perfect as possible the outcomes of the target variable. For example, if we had a binary feature, this one will have a low entropy and will be useful for our split if all the True values represent a class1 and all the False values represent a class2. As this concept moves away the level of entropy increases, until the point were half of the Trues are within class1 the other half of the Trues is class1; half of False is Class2 and the other half of the Fase is Class1. This variable will be technically useless for our model in this case.

Let’s start by calculating the entropy of the first feature length, but before going into formulas, let’s count the number of occurences trues and false have within this feature.

Length has 3 different classes [3,4 and 5] so lets count how many times length=3 resulted into the animal being a killer whale:

Only two, let’s now count:

How many times l=3 was not a killer whale
How many times l=4 was a killer whale
How many times l=4 was not a killer whale
How many times l=5 was a killer whale
How many times l=5 was not a killer whale

Length (3,4,5) = {2+,0-} {1+,3-} {2+,2-}

Now based on this number we can start calculating the entropy

entropy = -p log(p) - (1-p) log(p)

Now we will exchange the p for our values:

e(l=3) = -(2/2)log(2/2)-(0/2)p(0/2)=0

This is an example of what is called a pure segment and it equals 0 because all elements belong to only 1 class. Lets see the rest of the classes of length.

e(l=4) = -(1/4)log(1/4)-(3/4)p(3/4)= 0.81
e(l=5) = -(2/4)log(2/4)-(2/4)p(2/4)= 1

To finalice we will calculate the average mean of the calculations:

Length (3,4,5) = {2+,0-} {1+,3-} {2+,2-}

entropy = (2/10)*0 + (4/10)*0.81 + (4/10)*1 = 0.72

Ok perfect this is how we calculate the entropy of our first feature and guess what? Your computer makes this calculations in less than a second ;)

Now, let’s calculate the entropy for all the rest of the variables:

Based on these results we can conclude that the best feature to make the first split is: “Gills”. This process is then repeated until no features are left and that is how the tree is created.

Now that we know how the algorithm creates the splits, let’s make some code.

Coding a Decision Tree

Now comes the interesting part, we will be creating a function to train a set of decision trees and make the analysis we always do regarding which encoding method is the best for this model and also which combination of features fits our algorithm best. So let’s take a look at our function, it is pretty similar to the ones we have used for other models, so I will not be explaining it in detail

def testModel(df,var_order,n_vars,n_loops,method):
    highest = 0
    for j in tqdm(range(1,n_vars)):
        #split our dataframe into X and Y
        x,y=getXandY(df)
        #create the lists to store metrics
        acc = []
        rec = []
        preci = []
        f1 = []
        for i in range(n_loops):
            #split the dataFrame into test and train
            X_train, X_test, y_train, y_test = trainTest(x,y)
            #Oversample the train dataset with SMOTE
            X_train_os, y_train_os=overSampling(X_train, y_train, y_test, smote)
            #define the variables order 
            X_train_os_r = X_train_os[var_order]
            X_test_r = X_test[var_order]
            df1= X_train_os_r.iloc[:, 0:j] #use only part of the variables
            
            #create and train decision trees
            tree_clf = DecisionTreeClassifier()
            tree_clf.fit(df1, y_train_os)
        
            y_pred=tree_clf.predict(X_test_r.iloc[:, 0:j])
            ac=metrics.accuracy_score(y_test, y_pred)
            acc.append(ac)
            p=metrics.precision_score(y_test, y_pred,average='macro')
            preci.append(p)
            r=metrics.recall_score(y_test, y_pred,average='macro')
            rec.append(r)
            f=metrics.f1_score(y_test, y_pred, average='macro')
            f1.append(f)
        print(df1.columns)
        print("For {} features: \n Accuracy: {} \n Precision: {} \n Recall: {} \n F1 score: {}".format(
        j,mean(acc),mean(preci),mean(rec),mean(f1)))
        
        if mean(acc)>highest:
            highest = mean(acc)
            best = "best accuracy = {}, with {} features, with {}".format(mean(acc),j,method)
        print(best)
        #print(classification_report(y_test, y_pred))
    print(best)
        
def analizeDF(df,order,n_vars,n_loops):
    for i in range(len(order)):
        print('------------------------- Analyzing method {} -------------------------'.format(method[i]))
        print('The variable order is: \n {}'.format(order[i]))
        testModel(df,order[i],n_vars,n_loops,method[i])
        print('\n \n')

This is the summary of my results, it shows the best combination of features according to each encoding method and feature selection techniques:

With his we can conclude the best combination for our decision tree algorithm is having 7 figures according to the Mutual Information Classification technique and using One Hot Encoding, we have a 87.8% accuracy.

Hyper parameter tuning of Dicision Trees

Sklearn allows us to configure some hyperparameters, such as the method used to split the features using entropy or the gini index, if we want to use the best split method explained earlier or we want to do it random, how many levels our tree can have, and how many samples a split can have as a minimum. As we are using a oversampled dataset I will be using my own function to make the equivalent method to a GridSearch Cross -Validation.

crit=['gini', 'entropy']
split=['best', 'random']
max_d=[1,2,5,10,20,50,100,250,500,1000,2000,5000,10000]
min_samp_split=[2,5,10,20,50,100]

With this function we will be able to create a model using each and every combination of hyperparameter values, be careful because depending on the quantity of loops this might take a couple of minutes

def tuneDecisionT(X,Y,mu,crit,split,max_d,min_samp_split,loops):
    arr = []
    highest=0
    for i in range(len(crit)):
        for k in range(len(split)):
            for a in range(len(max_d)):
                for b in range(len(min_samp_split)):
                    for l in range(loops):
                        X_train, X_test, y_train, y_test = trainTest(X,Y)
                        #Oversample the train dataset with SMOTE
                        X_train_os, y_train_os=overSampling(X_train, y_train, y_test, smote)
                        #define the variables order 
                        X_train_os_r = X_train_os[mu]
                        X_test_r = X_test[mu]
                        tree_clf = DecisionTreeClassifier(criterion=crit[i],
                                                         splitter=split[k],
                                                         max_depth=max_d[a],
                                                         min_samples_split=min_samp_split[b])
                        tree_clf.fit(X_train_os_r,y_train_os)
                        y_pred=tree_clf.predict(X_test_r)
                        score = accuracy_score(y_test, tree_clf.predict(X_test_r))
                        arr.append(score)
                    print("For the parameters crit:{}, split:{}, "
                          "maxDep:{}, maxSamp:{} accuracy is:".format(crit[i],
                                                                     split[k],
                                                                     max_d[a],
                                                                     min_samp_split[b]))
                    mean_acc=mean(arr)
                    print(mean_acc)
                    #check the best configuration
                    if mean_acc > highest:
                        highest = mean_acc
                        description1 = "-----------------Best values = crit:{}, split:{}, ".format(crit[i],
                                                                                 split[k])
                        description2= " maxDep:{}, min_Samp:{} with accuracy: {}---------".format(
                                                                                 max_d[a],
                                                                                 min_samp_split[b],
                                                                                 mean_acc)
                    else:
                        highest = highest
                    arr = []
        print(description1+description2)

Now this is how we will be using the function based on the best parameters described up

X,Y=getXandY(hp_oHe)
mu = ['day', 'temperature', 'month', 'humidity', 'hour', 'pressure','dayOfTheWeek']
loops=10
tuneDecisionT(X,Y,mu,crit,split,max_d,min_samp_split,loops)

And this is the best result:

Best values = crit:entropy, split:random,  maxDep:500, min_Samp:2 with accuracy: 0.87889

Not very different to the one we had before but it is good. Let’s see how we can improve the result.

Ensemble methods

The whole idea behind ensemble methods is to use a set of models, trained with different parts of our dataset (so they behave differently) and use them as a group to improve the overall performance.

I want to talk about two different ensemble methods:

Bagging algorithms: Let’s imagine that we have a model M1 with a performance X, we want to improve it but we don’t know how. The bagging technique in simple words will use several models of the same kind, train them with different parts of the data and then ask them to make predictions. Once all of them have their result, there will be an algorithm that compares all predictions and make a final decision.

For example let’s say we are using a support vector machine to predict if the tomorrow it will be raining. I will train different 5 SVM with different parts of my dataset, and after asking all my models for their predictions I get the following results:

Let’s focus on the column labeled as Monday. 4 out of the 5 SVM predicted that there will be rain, the last row called voting is an algorithm that based on the predictions will give the final verdict. There are different ways of voting, this one is called “majority voting” and it consists only on taking as a winner the one which has the biggest amount of equal predictions. We also have weighted voting, which assigns more weight to those predictions which have a higher confidence.

Let’s now analyze our first implementation of a Bagging algorithm, it is called a Random Forest and the idea behind it is very simple, we will have a set of different Decision Trees and we will do voting on their results. This model has proven to be extremely good.

Random Forests

I will be reproducing the same steps as in the decision tree analysis, I will check which feature combination and which encoding method works the best for this specific model, let me give you my results:

BTW you can check this notebook here: https://colab.research.google.com/drive/1JJAIZkany49ly3SQPLVtPzdU3Vg5vpWi?usp=sharing

As you can see there is a little improvement over Decision Trees only a 1% and the best combination is using One Hot Encoding, with 21 features based on the MRMR analysis.

Now lets tune some hyperparametes:

bootstrap1= [True, False]
max_depth1=[10, 20, 50, 80, 100, None]
max_features1= ['auto', 'sqrt']
min_samples_leaf1= [1, 2, 4, 8, 10]
min_samples_split1= [2, 5, 10]
n_estimators1= [200, 500, 1000, 1500]

Let me give you also the function I used for testing each value of the hyperparametes, but feel free to take a better look in the notebook:

def tuneRandomF(X,Y,mu,bootstrap1,max_depth1,max_features1,min_samples_leaf1,
                min_samples_split1,n_estimators1,loops):
    arr = []
    highest=0
    for a in range(len(bootstrap1)):
        for b in range(len(max_depth1)):
            for c in range(len(max_features1)):
                for d in range(len(min_samples_leaf1)):
                    for e in range(len(min_samples_split1)):
                        for f in range(len(n_estimators1)):
                            start = timer()
                            for l in range(loops):
                                start = timer()
                                X_train, X_test, y_train, y_test = trainTest(X,Y)
                                #Oversample the train dataset with SMOTE
                                X_train_os, y_train_os=overSampling(X_train, y_train, y_test, smote)
                                #define the variables order 
                                X_train_os_r = X_train_os[mu]
                                X_test_r = X_test[mu]
                                rnd_clf = RandomForestClassifier(bootstrap=bootstrap1[a],
                                                                 max_depth=max_depth1[b],
                                                                 max_features=max_features1[c],
                                                                 min_samples_leaf=min_samples_leaf1[d],
                                                                 min_samples_split=min_samples_split1[e],
                                                                 n_estimators=n_estimators1[f])
                                rnd_clf.fit(X_train_os_r,y_train_os)
                                y_pred=rnd_clf.predict(X_test_r)
                                score = accuracy_score(y_test, rnd_clf.predict(X_test_r))
                                arr.append(score)
                            print("For the parameters boots:{}, mx_depth:{}, mx_fts:{} "
                                  "min_leaf:{}, min_splt:{}, n_stim {} accuracy is:".format(bootstrap1[a],
                                                                 max_depth1[b],
                                                                 max_features1[c],
                                                                 min_samples_leaf1[d],
                                                                 min_samples_split1[e],
                                                                 n_estimators1[f]))
                            mean_acc=mean(arr)
                            print(mean_acc)
                            end1 = timer()
                            print(timedelta(seconds=end1-start))
                            #check the best configuration
                            if mean_acc > highest:
                                highest = mean_acc
                                description1 = "---------------Best values = "
                                d12="boots:{}, mx_depth:{}, mx_fts:{}, min_leaf:{}, ".format(bootstrap1[a],
                                                                                         max_depth1[b],
                                                                                         max_features1[c],
                                                                                         min_samples_leaf1[d])
                                description2= "min_splt:{}, n_stim{} with accuracy: {}---------".format(
                                                                                         min_samples_split1[e],
                                                                                         n_estimators1[f],
                                                                                         mean_acc)
                            else:
                                highest = highest
                            arr = []
        print(description1+d12+description2)

And the most important part, the results:

Best values = boots:False, mx_depth:50, mx_fts: auto, min_leaf:1, min_splt:2, n_stim200 with accuracy: 0.8826803682295011

In this case I only run once each model, no mean value because due to the amount of values in each hyperparameter, the processing time skyrocketed. Thank you for reading! Next time we will be using boosting algorithms so keep tuned!

RODRIGO LEDESMA