Part 10 How to treat imbalanced datasets while doing Cross-Validation

Rodrigo Ledesma
Jun 17, 2022
5 min read

Hi there! Welcome back to another episode of FUN WITH FLAGS! Just kidding I hope you understood the The Big Bang Theory joke. In our last post, we had some fun analyzing how the number of classes in the target variable impacts our model's metrics. In this post, I will show you how to correctly treat imbalanced datasets and cross-validation, specifically kFolds technique.

If this is the first time you visit my blog, the purpose of this series of articles is intended to create a Machine Learning model to predict how long will it take an average visitor to wait in line at a Disney or Universal park before they can ride their favorite rollercoaster and with this information, optimize their visit by scheduling parks in the day where their favorite parks are less crowded.

The conclusion of the last post was that when we reduced the number of classes to 4 and then to 3, the accuracy and the general metrics improved. This is not necessarily good as we are losing information and precision on our predictions, instead of predicting a specific time, we are now predicting a range. But now we will be discussing a delicate topic, cross-validation.

Cross-Validation

This technique is very simple to understand, and also to implement. Imagine your data is split and contained in 10 different drawers. What cross-validation does is to resample or data and train a model with different parts of our data to avoid overfitting. Coming back to our example, let’s imagine that we will use the data inside drawer 1 to test and the data from drawers 2 to 10 to train, then evaluate for the first time our model. Next, we will use the data from drawers 1, 3–10; and then test the model with data from drawer 2. And so on until all drawers are used for testing.

Quite easy isn’t it? Just as winning a game against the Ottawa Senators. Well, let’s see how our data behaves with cross-validation:

Using Cross-validation with our Harry Potter waiting times data, the incorrect way

The first step will be to use the CSV file and extract the information into a dataframe, but not any CSV, the one we used in our last post that has only 3 classes in the target variable:

hp = pd.read_csv('HP_OHE_3class.csv')
hp = hp.drop('Unnamed: 0',axis=1)

Now let’s divide into train and test and oversample the data

def getXandY(df):
    df.drop(df.tail(20).index,inplace=True) 
    x = df.drop(['HP_Forbidden_clean'],axis=1)
    y = df.HP_Forbidden_clean
    return(x,y)def trainTest(x,y):
    X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.30, shuffle=True)
    return(X_train, X_test, y_train, y_test)smote = SMOTE(random_state=42)
X,Y=getXandY(hp)
X_train, X_test, y_train, y_test = trainTest(X,Y)
X_train_Smote, y_train_Smote= smote.fit_resample(X_train, y_train)

Finally, let’s use only the variables that got the best result with the MRMR technique:

mrmrO = ['month', 'day', 'year', 'hour', 'minute', 'holiday', 'dayOfTheWeek', 'temperature', 'humidity', 
         'pressure', 'heavy intensity rain', 'light rain', 'broken clouds', 'scattered clouds', 
         'thunderstorm with rain', 'few clouds', 'thunderstorm', 'shower rain', 'heavy intensity rain', 
         'mist', 'scattered clouds']
X_train_Smot_r = X_train_Smote[mrmrO]
X_test_r = X_test[mrmrO]

We are ready to train and test our model! but before, let’s define the parameters of the cross-validation technique:

from sklearn.model_selection import cross_val_scorelogisticRegr = LogisticRegression(max_iter=20000)
scores = cross_val_score(logisticRegr, X, y, cv=5)

This score will generally turn out to be great, much better when compared to the one we did originally. But this is incorrect, if you notice, you are splitting the upsampled data into training and testing, and as you can remember, testing with upsampled data is not allowed. Only the training data must be balanced. But how are we going then to use the cross-validation function if we are not able to split our data beforehand?

Solving the problem of imbalanced datasets and cross validation

Well unfortunately there is no sklearn function that will help us treat training sets and test sets individually and split them into folds. But that does not mean we cannot succeed in our task. The easiest solution is going to be to create our own function, and if you are having this same problem and do not have time to code it please feel free to copy-paste the one I have created. Let’s see step by step what the function does and at the end I will give the complete code:

1 First we need to define some base variables:

k= 5 #number of folds our df will be divided into
a= len(X) #lenght of my complete dataframe
n= math.floor(a/k) #number of elements (rows) in each fold
k=k+1 #sum one because the for starts in 1 and not in zero

2 Now let’s create our training and test datasets, but we will create them accordingly to the iteration. If we are in the first and last fold, then our test dataset is the last one and the first one accordingly, so I just did a simple if function for this:

if i == 1:
  xtrain_fold = X.iloc[n:-1]
  ytrain_fold = y.iloc[n:-1]
  xtest_fold = X.iloc[:n]
  ytest_fold = y.iloc[:n]
elif i == k:
  xtrain_fold = X.iloc[:(i-1)*n]
  ytrain_fold = y.iloc[:(i-1)*n]
  xtest_fold = X.iloc[(i-1)*n:-1] 
  ytest_fold = y.iloc[(i-1)*n:-1] 
else:
  xtrain1_fold = X.iloc[:(i-1)*n,:]
  xtrain2_fold = X.iloc[i*n:-1,:]
  xtrain_fold = pd.concat([xtrain1_fold,xtrain2_fold],axis=0)  ytrain1_fold = y.iloc[:(i-1)*n]
  ytrain2_fold = y.iloc[i*n:-1]
  ytrain_fold = pd.concat([ytrain1_fold,ytrain2_fold],axis=0)  xtest_fold = X.iloc[(i-1)*n:i*n]
  ytest_fold = y.iloc[(i-1)*n:i*n]

Let’s make a quick analysis of what is happening in the first if. Let’s say as an example our dataset consists of 50 values and we are creating 5 folds then:

n = 10
k = 5
a = 50

With this in mind our first training set for the first fold will have all values from 10 to 50, and the test will have values from the beginning to 10 (1–10). If we are on the last fold, then our training dataset will have all values from the beginning to one before the last fold ( 1–40) and the test will be 40–50. If we are neither in the first or last then our training data will go from the first to the ith minus i times n value till the ith time n value to the last one. And the test set will go from the range untouched, from the ith minus 1 times n to the ith times n value.

3 Upsample the training dataset, leave the test untouched and train our model:

# Upsample only the data in the training section
xtrain_fold_upsample, ytrain_fold_upsample = smoter.fit_resample(xtrain_fold,ytrain_fold)
# Fit the model on the upsampled training data
model_obj = logisticRegr.fit(xtrain_fold_upsample, ytrain_fold_upsample)
# Score the model on the (non-upsampled) validation data
score = accuracy_score(ytest_fold, model_obj.predict(xtest_fold))

This process will give you k different models each with a performance metric, I used accuracy in this case and also at the end I just asked the function fo print the mean value. Here is now the complete code:

from sklearn.metrics import recall_score, accuracy_score
smoter = SMOTE(random_state=42)
scores = []def manualKFolds(X, y, k):
    a= len(X) #lenght of my complete dataframe
    n= math.floor(a/k)
    k=k+1 #sum one because the for starts in 1 and not in zerofor i in range(1,k):
        if i == 1:
            xtrain_fold = X.iloc[n:-1]
            ytrain_fold = y.iloc[n:-1]
            xtest_fold = X.iloc[:n]
            ytest_fold = y.iloc[:n]
        elif i == k:
            xtrain_fold = X.iloc[:(i-1)*n]
            ytrain_fold = y.iloc[:(i-1)*n]
            xtest_fold = X.iloc[(i-1)*n:-1] 
            ytest_fold = y.iloc[(i-1)*n:-1] 
        else:
            xtrain1_fold = X.iloc[:(i-1)*n,:]
            xtrain2_fold = X.iloc[i*n:-1,:]
            xtrain_fold = pd.concat([xtrain1_fold,xtrain2_fold],axis=0)ytrain1_fold = y.iloc[:(i-1)*n]
            ytrain2_fold = y.iloc[i*n:-1]
            ytrain_fold = pd.concat([ytrain1_fold,ytrain2_fold],axis=0)xtest_fold = X.iloc[(i-1)*n:i*n]
            ytest_fold = y.iloc[(i-1)*n:i*n]
   
        try:
            xtrain_fold = xtrain_fold.drop('index',axis=1)
            ytrain_fold = ytrain_fold.drop('index',axis=1)
            xtest_fold = xtest_fold.drop('index',axis=1)
            ytest_fold = ytest_fold.drop('index',axis=1)
        except:
            print("")# Upsample only the data in the training section
        xtrain_fold_upsample, ytrain_fold_upsample = smoter.fit_resample(xtrain_fold,ytrain_fold)
        # Fit the model on the upsampled training data
        model_obj = logisticRegr.fit(xtrain_fold_upsample, ytrain_fold_upsample)
        # Score the model on the (non-upsampled) validation data
        score = accuracy_score(ytest_fold, model_obj.predict(xtest_fold))
        print(score)
        if i>1:
            scores.append(score)print('Mean accuracy of the model: {}'.format(mean(scores)))

My dataset is not very big and the values I obtained are very dispersed, but in the end, I ended up with a realistic accuracy of between 50 and 53%. So it is realistic.

Please feel free to use the code in my google colab if you might find it beneficial for your projects or research.

Colab link: https://colab.research.google.com/drive/1fAjzFLECjqYO8dLmlfzIqYw9R4bHJMcc?usp=sharing

As always thank you for reading and I hope this has been helpful for you. In the following posts, we will be using different ML algorithms to treat our data and compare the results, also we will be analyzing different methods, such as Grid Search Cross-Validation for parameter tuning.

RODRIGO LEDESMA

Part 10 How to treat imbalanced datasets while doing Cross-Validation

Cross-Validation

Using Cross-validation with our Harry Potter waiting times data, the incorrect way

Solving the problem of imbalanced datasets and cross validation

Recent Posts

Comments