Hi again! Welcome to my newest post. Here we will be starting to do data science and some machine learning.
If you have been reading my posts (thank you) by now you must know that these posts are oriented toward predicting how long it will take for an average visitor to wait in line before riding their favorite attraction at Walt Disney World or Universal Studios. In my last post, I decided to narrow down the rides and start only with Harry Potter and the Forbidden Journey, as it is my favorite ride. But that was only a personal opinion. Also in my last post, we normalized data, treated categorical values, and turned strings into numerical values as preprocessing. Now we are almost ready to train a model.
Feature selection techniques
The objective of feature filtering is to analyze which variables or features are the most relevant for our analysis. It is not an easy task and like most things in Machine Learning, there is not only a correct way to do it, so we will need to do several experiments before we can achieve a conclusion.
There are different groups of techniques and I will be analyzing the performance of each technique. We will be using different types of correlation, variance threshold, mutual information and Maximum Relevance — Minimum Redundancy. So let’s start by using the easiest technique correlation.
Filtering Techniques: Correlation
Filtering techniques based their analysis on statistical principles, and assign a numerical value to each variable or feature, based on that number, there is a second step of features discrimination to “filter” only the most relevant variables of the whole dataset.
Correlation
Let’s start with the most popular type of correlation, which is the Pearson Correlation. Generally speaking the purpose of analyzing the correlation is to find the relationship between only two variables. The analysis will output a value between -1 and 1. If the value is near to 1 or -1, this means that the increase or decrease in the independent variable directly affects the behaviour in the dependent variable. This method finds only linear dependencies, and this is its formula:
Looks complicated but is actually pretty straightforward. Xi indicates the individual analized variable in each iteration of the sumation, while X-bar represents the mean of all X values. sx represents the sum of squares of x, and the same applies for Y. Finally, n is the number of elements the array of X has which should be equal to the number of elements in Y.
One famous alternative correlation to Pearson is Kendall Correlation, which uses pairs of elements to find concordances or discordances. A concordance means that (x2 — x1) and (y2 — y1) have the same sign, and a pair is discordant if (x2 — x1) and (y2 — y1) have opposite signs. Both methods look for linear relationships between two variables and we will be using them both for our analysis.
Enough math, let’s go into the coding part. Let’s define a function that given a dataFrame, will output the correlation of each independent variable with respect of the dependent variable and also a heatmap to make everything more visual.
def correlation(df,title):
df.drop(df.tail(17).index,inplace=True)
df = df.drop(['Unnamed: 0'],axis=1)
df.corr(method ='pearson')
df.corr(method ='kendall')
pearson_corr = df.corr(method='pearson')
kendall_corr = df.corr(method = 'kendall')
sns.heatmap(pearson_corr)
print(title)
plt.title("Pearson Correlation")
plt.show()
sns.heatmap(kendall_corr)
plt.title("Kendall Correlation")
plt.show()
hpPear = pearson_corr['Harry_Potter_and_the_Forbidden']
print('Correlation with Pearson method')
print(hpPear.sort_values(ascending=False))
hpKen = kendall_corr['Harry_Potter_and_the_Forbidden']
print ("")
print('Correlation with Kendall method')
print(hpKen.sort_values(ascending=False))
Easy! Just like winning a hockey match against the Senators! Well, let’s take a closer look to each line. I made the function expect a dataframe and also a title, because I want to print which type of encoding I am using (OhE, OE…) so based on my dataFrame containing the Harry Potter waiting time and weather variables I firs need to get rid of null values and one unnamed column that was created when I transformed my object into a pandas dataframe. Next, I just create two object using the function .corr() which is already in python, one with the Pearson methodology and another with Kendall, next I create a df to store the values. SNS is a library that allows the user to create chats and maps to visually analyze information, and in this case I am creating the heatmap of both correlations. And the last lines filter only the results of the harry potter feature which contains the relation of this variable with respect to the weather variables, the rest of the features are irrelevant.
So let’s analyze the results step by step, let’s see first the heatmap of the Pearson Correlation:
A quick description, each box represents the correlation between two variables and the color of the box is the result of the operation, the diagonal line is white because it represents the same variable in both axis and also we can see that for this analysis, Pandemic feature is actually useless.
Now lets take a look at the correlation results only for the HP ride:
Correlation with Pearson method
Harry_Potter_and_the_Forbidden 1.000000
temperature 0.178418
holiday 0.159011
day 0.105527
month 0.097152
pressure 0.090793
day.1 0.089410
report 0.012176
minute 0.007772
year -0.028160
humidity -0.128659
hour -0.196399
Pandemic NaN
Name: Harry_Potter_and_the_Forbidden, dtype: float64
Correlation with Kendall method
Harry_Potter_and_the_Forbidden 1.000000
temperature 0.139885
holiday 0.102868
month 0.092505
day 0.089228
day.1 0.071534
pressure 0.070398
report 0.021043
year 0.012482
minute 0.006215
humidity -0.122661
hour -0.168177
Pandemic NaN
Name: Harry_Potter_and_the_Forbidden, dtype: float64
As you can appreciate, the order and the values for the features change when we apply different methods of correlation so it is not a bad idea to test both and see which one has the best results. Now digging deeper we can see that temperature plays a huge role, even though the correlation ratio is smaller than 0.2. There is no strong correlation on any of the features with the ride’s waiting time so far but based on this analysis we can start to filter some features.
Mutual Information Technique
This is a statistical method based on the Kullback–Leibler divergence. Its mathematical explanation is complicated so I will not go into it but generally speaking it is going to do the calculation of dependency for each of independent variables with respect to dependent variable. It is important to take into consideration this technique because opposite to Correlation, this calculation analyses any kind of relationship, not only linear dependencies between features.
As per our luck, sklearn has a library called featue_selection and it has the mutual information function, both for classification and for regression. Right now I will be using both of them and later on I will compare how they performed. You can see the testing and the step by step in my google colab file, also there you can download all my datasets. I found two different implementations of this function so I will be testing out 3 different techniques (Mutual info class1, mutual info regression and mutual info class2)
from sklearn.feature_selection import SelectKBest, SelectPercentile, mutual_info_classif, mutual_info_regressiondef mutualInfo(df,k):
x = df.drop(['Harry_Potter_and_the_Forbidden','Unnamed: 0'],axis=1)
y=df['Harry_Potter_and_the_Forbidden']
#classification
selector = SelectKBest(mutual_info_classif, k)
X_reduced = selector.fit_transform(x, y)
cols = selector.get_support(indices=True)
selected_columns = x.iloc[:,cols].columns.tolist()
print('Mutual information for classification--------------')
print(selected_columns)
#regression
selector = SelectKBest(mutual_info_regression, k)
X_reduced = selector.fit_transform(x, y)
cols = selector.get_support(indices=True)
selected_columns = x.iloc[:,cols].columns.tolist()
print('Mutual information for regression------------------')
print(selected_columns)
#second code with classification
threshold = k # the number of most relevant features
high_score_features = []
feature_scores = mutual_info_classif(x, y, random_state=0)
print('Second code for mutual information for classification------------------')
for score, f_name in sorted(zip(feature_scores, x.columns), reverse=True)[:threshold]:
print(f_name, score)
high_score_features.append(f_name)
HP_MI = x[high_score_features]
Let’s analyze the code step by step, first I do a little pre-processing of the information, delete some useless variables and split my data into independent variables (x) and dependent variables (y). Then I create an object that contains the result of a sub process called ‘selectKBest’ this is a separate method of features filtering but as it only gets linear dependencies I will not use it by itself. In this case you can see the variable k is present and it is the number of features we want to analyze, there are some posts that will use only the 50% of the variables, or maybe some percintles, but in this case as I am not sure how many important features there are I will ask the function to analyze all of them. As you can see, the regression code and the classification code are nearly identical, so let’s jump into the third method.
Here we use only the mutual_info_classif function, and as you can see we then have a for look which will print the feature and its numerical result, so let’s take a look at the results of the three methods.
Mutual information for classification--------------
['month', 'day', 'year', 'hour', 'minute', 'holiday', 'day.1', 'Pandemic', 'temperature', 'humidity', 'pressure', 'report']Mutual information for regression------------------
['month', 'day', 'year', 'hour', 'minute', 'holiday', 'day.1', 'Pandemic', 'temperature', 'humidity', 'pressure', 'report']Second code for mutual information for classification---------------
day 0.1850
temperature 0.1828
month 0.1676
humidity 0.1568
hour 0.1492
day.1 0.0899
pressure 0.08935
year 0.01867
holiday 0.01862
report 0.0099
minute 0.0049
Pandemic 0.0009
As you can appreciate, using the mutual_info function together with SelectKBest bring the same results when using either classif or regression, the interesting part comes when we analyze the second clode, as here the order of the variables change and we can see now that the numerical values can help us discriminate our features. One good thing is that all our methods’ results agree in the fact that pandemic feature is useless so far, also that day is relatively important.
Variance Threshold
This is a very easy and practical method used for filtering, it’s base is in the statistical variance which analyzes how far a set of numbers is spread out from their average value, or its dispersion. The code we will be working with is going to do the analysis and show the results that are bigger than certain threshold. So let’s take a look at them.
from sklearn.feature_selection import VarianceThresholddef varianceTh(df):
x = df.drop(['Harry_Potter_and_the_Forbidden','Unnamed: 0'],axis=1)
y=df['Harry_Potter_and_the_Forbidden']
selector = VarianceThreshold(threshold=0.01) # Variance threshold
sel = selector.fit(x)
sel_index = sel.get_support()
HP_VT = x.iloc[:, sel_index]
print(HP_VT.columns)
The code is easy and straight forward. I used a 0.01 threshold and alter on I only use index for printing the names of the features, the result of running the function is the following:
['month', 'day', 'year', 'hour', 'minute', 'holiday', 'day.1', 'temperature', 'humidity', 'pressure', 'report']
One more time we can see day is important, also the month, and here the pandemic feature does not appear as its variance is smaller than 0.01; so, irrelevant.
Maximum Relevance — Minimum Redundancy
This is a very interesting method that has been widely studied and accepted in the academic community. There are several papers that analyze the benefits for reducing the number of features based on this technique. The bases of this approach is that it uses minimal-optimal, which means, if we have 2 features that are both relevant, a normal algorithm will recommend to use both, but in this case, MRMR will discard one, because they bring approximately the same information to the analysis.
As to the day I am writing this post, sklearn has not implemented yet the function in its library but, there is another alternative.
from mrmr import mrmr_classif
from sklearn.datasets import make_classificationdef mrmr(df,k):
x = df.drop(['Harry_Potter_and_the_Forbidden','Unnamed: 0'],axis=1)
y=df['Harry_Potter_and_the_Forbidden']
Y = pd.Series(y)
# use mrmr classification
selected_features = mrmr_classif(x, Y, K = k)
print(selected_features)
The code is extremely easy, actually the function is implemented in only one single line, the rest is only data pre-processing. In this case, I asked the function to give all the elements in order of importance, and this was the result.
['temperature', 'hour', 'holiday', 'humidity', 'day', 'year', 'day.1', 'month', 'pressure', 'minute', 'report']
Now comes the tedious part, in part 3 I used different encodings for the categorical variable (oneHotEncoding, OrdinalEncoding and ManualEncoding) I will run all filtering methods for the 3 encodings and I will post the order of features relevance, where the higher the variable is, the more important it is.
Here are the results I have obtained, in my next post I will be running tests and analyzing the development of the models based on combinations of these features.
Hope you have enjoyed this part! See you next time!
Comments