Part 9 Improving the accuracy of the model

Rodrigo Ledesma
Jun 16, 2022
3 min read

Welcome back to part 9 of this series. If you are a new reader, welcome! In my last post, we discussed the performance metrics of the logistic regression model. We analyzed the recall, precision, F1 score, and accuracy of all possible combinations of techniques and encoding methods. Unfortunately, the metrics did not turn out so well. We had a 30% accuracy performance. In this post, we will now look at ways to improve accuracy.

If this is the first time you visit my blog, the purpose of this series of articles is intended to create a Machine Learning model to predict how long will it take an average visitor to wait in line at a Disney or Universal park before they can ride their favorite rollercoaster and with this information, optimize their visit by scheduling parks in the day where their favorite parks are less crowded.

How to keep improving the accuracy by molding the target variable

If you take a glance at my google colab you will notice I did a little data cleansing, but nothing we have not done before, the only important part was to plot all the histograms. Here is a taste of the most important features:

Now it is quite obvious that certain features from the one-hot encoding will not be as necessary as others, as they are binary and have almost 0 elements in the second class.

Moving on one problem present here was that the features do not have a strong correlation with the target variable, so it really does not matter how much we combine them, the result will be similar, so what we need to do, is to help our problem by giving him less work to learn. What I mean by this is simple. Let me speak up my mind.

Reducing the classes in the target variable

As of now, the model is working with 6 different classes to make predictions (10, 20, 30, 40, 50, 60, and 100 mins). Also, let me remind you that I trimmed the dataset by deleting the outliers. This might not have been the brightest idea, but here is what I will do next:

Now let’s change the classes within the target variable, instead of deleting the outliners, let's group them into four classes:

Up to 30 min (group 1)
From 30 to 60 min (group 2)
From 60 to 120 mins (group 3)
-More than 120 mins (group 4)

Here is how we will be handling this change:

#delete rows with 0 min
hp = hp[hp.Harry_Potter_and_the_Forbidden != 0]#Replace values with the concoding we constructed above
a=hp.Harry_Potter_and_the_Forbidden.replace([
5, 10, 11, 15, 20, 25, 30, 35, 40, 45,50.0,55.0,60.0,65.0,70.0,75.0,80.0,85.0,90.0,95.0,100.0,105.0,110.0,115.0,120.0,125.0,130.0,135.0,145.0,150.0,180.0], 
[1,1,  1,  1,  1,  1,  1,  2,  2,  2, 2,   2,   2,   3,   3,   3,   3,   3,   3,   3,   3,    3,    3,    3,     3,    4,   4,    4,     4,    4,    4])

As you can see all values from 5 to 30 were replaced by a 1, from there values from 30 to 60 were replaced by a 2, and so on. Now, let me take you through how I constructed the final dataframe just for fun

df=pd.DataFrame(a)
df.rename(columns = {'Harry_Potter_and_the_Forbidden':'HP_Forbidden_clean'}, inplace = True)
hp_bis=pd.concat([hp, df], axis=1)
hp = hp_bis.drop('Harry_Potter_and_the_Forbidden',axis=1)#Train the model with the new encoding logisticRegr = LogisticRegression(max_iter=20000)
logisticRegr.fit(X_train_Smote, y_train_Smote)
y_pred=logisticRegr.predict(X_test)
print(classification_report(y_test, y_pred))

It's alive! It’s alive! Did you notice that we increased the accuracy from 30% all the way to 50% !? Quite good isn’t it? It has its perks and cons. One con is that the model will be making predictions that will be less useful, because we will not be actually predicting a time, but we will be predicting a range of times.

In this case scenario, I used all the variables and according to our previous post, the MRMR analysis said the model will be optimized with only 3 variables. Unfortunately, this is not the case anymore. So I had to re-run the tests and I actually got different results, nothing to be alarmed about. I also run a final test, instead of reducing the target variable to 4 classes, I reduced it to 3. The ranges were:

From 5 to 30 mins
From 30 mins to 60 mins
More than 60 mins

For the final analysis, I’ll present a graph with the metrics and see which configuration is the most optimal, then we will make a decision and keep on.

As seen in the graph above, when we reduce the classes from 4 to 3, the accuracy improves. In a conclusion, we will stick with 3 classes on the target variables for our analysis.

This has been all for our post, on the next posts we will be using cross-validation techniques and different ML algorithms to create more complex models and hopefully better models.

Thank you so much for reading and I hope this has been helpful.

Google Colab link: https://colab.research.google.com/drive/1LM5tbdlHh6Txs0Cyc8IjLi72pyMQ3DnA?usp=sharing

GitHub: https://github.com/cotitaco/Magical-day-at-disney-with-machine-learning

RODRIGO LEDESMA

Part 9 Improving the accuracy of the model

How to keep improving the accuracy by molding the target variable

Reducing the classes in the target variable

Recent Posts

Comments