Part 11 Support Vector Machines for Classification problems

Rodrigo Ledesma
Jun 22, 2022
9 min read

Hi everyone! Welcome back to my blog. Thank you if you have been reading my posts, and if you read the last one, we will not be talking about cloud services in this post, google cloud services will have to wait a little longer, as I am really excited to work with SVM, trees and neural networks, but if you enjoyed the last post, let me know and I can make some time in between models to do some more GCP topics.

If this is the first time you visit my blog, the purpose of this series of articles is intended to create a Machine Learning model to predict how long will it take an average visitor to wait in line at a Disney or Universal park before they can ride their favorite rollercoaster and with this information, optimize their visit by scheduling parks in the day where their favorite parks are less crowded.

So, the next step will be, as the post’s title mentions how to use support vector machines in a classification problem. But first, let’s take a look into what a Support Vector Machine is and how we will use it.

Support Vector Machines

The idea behind SVMs is simple, try to imagine a hyperplane in two dimensions that actually contains all the information of an n-dimensional dataframe. This hyperplane will hold only two classes for making things easy, a SVM will calculate an imaginary line that creates two different groups, this line will actually be a vector and the model’s job will be to fit this line in the best possible coordinates, to make each group have as many correct classifications as possible. This vector can be represented with a filled line in the image above. Now, this division is not enough, as the SVM model will base the analysis also in two parallel supplementary vectors. You can think of these 3 parallel vectors as a highway with two lanes. Our job as data scientists will be to understand our data and to tune the width of this highway so that it contains the smallest amount of incorrect classifications (tuning the hyperparameters). The model will calculate the direction of the vector.

Of course, SVMs are not restricted only to having linear highways, we can train a model to construct polynomial highways to better fit our data.

If you are lucky enough to be able to plot your data in a 2 or 3-dimensional space you will be capable of understanding very well how the model fits your data. The problem will arise when you are working with a 400-dimensional space.

Using Support Vector Machines to predict Harry Potter waiting times

As we mentioned earlier the purpose of the blog and of this study is to create a model that will help us predict how long it will take an average visitor to wait in line before riding Disney World or Universal Studio's rollercoasters. So far we have squeezed almost all the juice from the Logistic Regression technique, so let’s give it a rest and try to improve our results using SVMs.

Thanks to Sklearn, the implementation of this model for prediction is extremely simple. Let’s see how to do it:

#read the dataset
hp = pd.read_csv('HP_OHE_3class.csv')
hp = hp.drop('Unnamed: 0',axis=1)
#Divide the information, oversample and filter featuressmote = SMOTE(random_state=42)
X,Y=getXandY(hp)
X_train, X_test, y_train, y_test = trainTest(X,Y)
X_train_Smote, y_train_Smote= smote.fit_resample(X_train, y_train)mrmrO = ['month', 'day', 'year', 'hour', 'minute', 'holiday', 'dayOfTheWeek', 'temperature', 'humidity', 
         'pressure', 'heavy intensity rain', 'light rain', 'broken clouds', 'scattered clouds', 
         'thunderstorm with rain', 'few clouds', 'thunderstorm', 'shower rain', 'heavy intensity rain', 
         'mist', 'scattered clouds']
X_train_Smot_r = X_train_Smote[mrmrO]
X_test_r = X_test[mrmrO]#Train and test our model:
lsvm = LinearSVC(C=1, loss="hinge")
lsvm.fit(X_train_Smot_r,y_train_Smote)
y_pred=lsvm.predict(X_test_r)
print(classification_report(y_test, y_pred))
                precision    recall  f1-score   support

         1.0       0.63      0.73      0.68      2345
         2.0       0.51      0.49      0.50      1535
         3.0       0.33      0.21      0.26       845

    accuracy                           0.56      4725
   macro avg       0.49      0.48      0.48      4725
weighted avg       0.54      0.56      0.54      4725

At first glance we can appreciate one thing, we have an accuracy of 56%, and we have not done any hyperparameter tuning, neither use more complex models such as the polynomial SVM. So we are in the correct track.

Trying different feature combinations again

We have seen that an untuned SVM gives better results when compared to the Logistic Regression Algorithm, but this does not mean we can just copy-paste the same process we did for this algorithm. What I mean is, there is no warranty that this specific combination of variables, result of the feature selection algorithm, will turn out to be the best for the SVM, so just for run and because I am on vacation, let’s run one more time all the combinations of all the feature selection methods for the 3 encoding configurations. But this time I will be only running them with 20 loops instead of 100.

If you go to the colab notebook you will be able to see all the configurations and pre-processing pipeline. It is the same as in Part 8, but in this case our target variable has only 3 classes. Please feel free to check out the results, I will be focusing one more time in accuracy and I will be only giving you the conclusive results, not a complete analysis as last time, because I will be very repetitive and time consuming.

And the new winner is the following configuration:

One Hot Encoding
Mutual Information Classification code 2
11 Features
Acc: 0.5707

It is an improvement compared to what we had with Linear Regression, so, let’s see if we can improve it even further via Hyperparameter Tuning.

Hyperparameter Tuning on Linear Support Vector Machines

When we were using Linear Regression there was not much we can configure from the algorithm to create the model. But in this case, the linear SVM has a special parameter called “C”. Not a very creative name but it is indeed useful. This variable will define (using the highway metaphor) how wide or narrow the highway will be. This will have a direct impact on the performance of the model, because this area of the hyperplane (depending on the width) will contain different quantities of classified elements.

In order to try out different combinations of hyperparameters automatically, sklearn has a beautiful solution called GridSearch CrossValidation. Let’s see what it does.

GridSearch CV

The gridSearch function will run a model with all possible combination of hyperparameters given by the user. This is useful because in a few lines of code, we can have an automated code that runs and analyzes the metrics of the created models.

The inputs for this function are only two:

A model
A list of parameters and values

The function will create a set of models using each combination of hyperparameters, it will test each of them and at the end, it will output the best combination, so that we can use it afterwards without coding all combinations.

Let’s see the code:

from sklearn.model_selection import GridSearchCVparam_grid = {'C': [0.1, 0.5, 1, 5, 10, 50, 100], 'loss':['hinge', 'squared_hinge']}grid = GridSearchCV(LinearSVC(),param_grid)
grid.fit(X_train_Smot_r,y_train_Smote)

Easy like winning a match against the Ottawa Senators! But let’s see what we are doing. We incorporated the library and then created a dictionary with two hyperparameters, the loss function and the C parameter, each of which has an array of values. The function will use the values, make combinations, trains the models and test them.

This is very easy but we still don’t know which is the best combinations, so let’s use one more line of code:

print(grid.best_estimator_)LinearSVC(C=1, loss='hinge')

Now we know what the function finds best. But this has a problem, we are using the oversized data for testing and training, which is not valid. So we will be creating a little code to use oversized data only in the training set.

c = [0.1, 0.5, 1, 5, 10, 50, 100]
LOSS = ["squared_hinge", "hinge"]
loops = 20
arr = []mu = ['day', 'temperature', 'month', 'humidity', 'hour', 'pressure','dayOfTheWeek', 'year',
      'holiday', 'shower rain', 'light rain']for i in range(len(c)):
    for j in range(len(LOSS)):
        for k in range(loops):
            X_train, X_test, y_train, y_test = trainTest(X,Y)
            #Oversample the train dataset with SMOTE
            X_train_os, y_train_os=overSampling(X_train, y_train, y_test, smote)
            #define the variables order 
            X_train_os_r = X_train_os[mu]
            X_test_r = X_test[mu]
            lsvm = LinearSVC(C=c[i], loss=LOSS[j])
            lsvm.fit(X_train_os_r,y_train_os)
            y_pred=lsvm.predict(X_test_r)
            score = accuracy_score(y_test, lsvm.predict(X_test_r))
            arr.append(score)
        print("For the parameters are C:{}, and loss:{} accuracy is:".format(c[i],LOSS[j]))
        print(mean(arr))

The code is pretty simple, but let’s see each piece. First the mu array contains the order of the variables that turned out to be the best according to the Mutual Information analysis. From there we have 3 loops, the first one will go through all elements in the C hyperparameter, the second loop will go through all values through the second hyperparameter and the third one will be to create N number of models, each with scrambled data and give a final metric using the mean result.

From there, we split the dataset, oversample, limit the amount of variables, train the mode, and calculate the accuracy. Here are the results:

For the parameters are C:0.1, and loss:squared_hinge accuracy is:
0.5477989417989418
For the parameters are C:0.1, and loss:hinge accuracy is:
0.553962962962963
For the parameters are C:0.5, and loss:squared_hinge accuracy is:
0.5508536155202822
For the parameters are C:0.5, and loss:hinge accuracy is:
0.5534259259259259
For the parameters are C:1, and loss:squared_hinge accuracy is:
0.5517544973544973
For the parameters are C:1, and loss:hinge accuracy is:
0.5533192239858906
For the parameters are C:5, and loss:squared_hinge accuracy is:
0.5526228269085411
For the parameters are C:5, and loss:hinge accuracy is:
0.5536534391534391
For the parameters are C:10, and loss:squared_hinge accuracy is:
0.5527854203409759
For the parameters are C:10, and loss:hinge accuracy is:
0.5534338624338624
For the parameters are C:50, and loss:squared_hinge accuracy is:
0.5516084656084657
For the parameters are C:50, and loss:hinge accuracy is:
0.5514964726631393
For the parameters are C:100, and loss:squared_hinge accuracy is:
0.5485649165649166
For the parameters are C:100, and loss:hinge accuracy is:
0.5484822373393802

A couple thoughts, the loss is better when configured with “hinge” and not with “squared hinge”. And the best parameter in C is 0.1. This is not the best we can do, a Support Vector machine does not have to be only linear, let’s see what happens when we make it polynomial.

Polynomial Support Vector Machine

So as we can see, the linear kernel of the support vector machine offers some improvement over the linear regression, but it is very unlikely that our data will be linearly separable dut to the quantity of variables and the behavior of the features. So the next step will be to try fitting polynomial vectors to the model and comparing the results.

For this we will be configuring 3 parameters and actually we will be using a different library let me show you an example:

from sklearn.svm import SVCpsvm = SVC(kernel="poly", degree=2, coef0=1, C=5)
psvm.fit(X_train_Smot_r,y_train_Smote)
y_pred=psvm.predict(X_test_r)
score = accuracy_score(y_test, psvm.predict(X_test_r))
score0.662010582010582

Wait… what!? A 10% improvement over the last SVM with only random hyperparameter tuning? Yes it is that good! Of course we will not stop with only 66% accuracy, let’s do the equivalent of a GridSearch with our oversized data. These are the values I will be testing:

deg = [1,2,3,4,5,6,7]
c=[0.01,0.1,1,5,10,50,100]

So I will be creating different models with different highway withds with different polynomial functions. Let me warn you, as the polynomial degree increases, also the time to train and test the model, I have been doing this analysis for a couple days now using two computers and it requires a lot of resources, so be careful when you run the google colab or the actual notebook.

Let’s take a look at the function that will allow us to run the tests while watching a tv series:

def polySMV(X,Y,mu,deg,c,loops):
    arr = []
    highest=0
    for i in range(len(deg)):
        for k in range(len(c)):
            for l in range(loops):
                start = timer()
                X_train, X_test, y_train, y_test = trainTest(X,Y)
                #Oversample the train dataset with SMOTE
                X_train_os, y_train_os=overSampling(X_train, y_train, y_test, smote)
                #define the variables order 
                X_train_os_r = X_train_os[mu]
                X_test_r = X_test[mu]
                psvm = SVC(kernel="poly", degree=deg[i], coef0=1, C=c[k])
                psvm.fit(X_train_os_r,y_train_os)
                y_pred=psvm.predict(X_test_r)
                score = accuracy_score(y_test, psvm.predict(X_test_r))
                arr.append(score)
                print("-")
                # End timer
                end1 = timer()
                #print(timedelta(seconds=end1-start))
                #check the best configuration
                if score > highest:
                    highest = score
                    description = "best values = degree:{}, and c:{}".format(deg[i],c[k])
                else:
                    score = score
            print("For the parameters degree:{}, and c:{} accuracy is:".format(deg[i],c[k]))
            print(mean(arr))
            arr = []
    print(description)

Ok so, first step, our function will receive our complete X and Y datasets, also the order of the 11 variables given by the mutual information method, the degrees we want to analyze in an array, also the c values in another array and finally how many times we want to create a model before we take the average.

As before, we have 3 loops, the first one will analyze each on the elements given in the degree array, the second one will go through all the C values and the third one will control the amount of times we create a model and test it with each combination. The first lines are very straightforward, we divide into train and test, oversample only the train dataset, we trim and rearrange the order or the variables according to the mutual information analysis and last we create, train and analyze the model’s performance.

We will be keeping track of the best accuracy, saving the configuration parameters of the degree and C in a string. Each time a combination is tested, the function will print out the mean value of the accuracy and compare if this has been the best so far. At the very end, the function will print the best configuration obtained.

Polynomial Support Vector Machine Results

So I did a little disaster with the notebooks because I wanted to use different parallel machines, so the 1st, and 2nd degrees’ results are in the notebook 2.1. 6th and 7th are in notebook 2.1.3 and the 3th 4th and 5th is at notebook 2.1.2. Here is a summary of the results:

And as we can see, the 7th degree polynomial has the best accuracy of all, but there is still one more trick that SVMs can do, and it is ti use an RBF kernel instead of a polynomial Kernel. You can see the code and all the results in the 2.1.2 notebook, I will just be presenting the best configuration.

For the parameters gamma:1, and C:1000 accuracy is:
0.8131640211640212

This trick of using different kernels will save us some processing time and also the model will be less complex and as we can see the accuracy obtained is quite acceptable. We have reached a point where the accuracy of our model has overpassed the random guessing technique and now we are having a trustworthy model for our main purpose.

This is the end of this post, I hope you have enjoyed or learned something from it. In the next posts we will be introducing ourselves into the wonderful world of tree-like models and ensembles. Thanks for reading!

RODRIGO LEDESMA