top of page

Part 3 Getting used to data

  • Writer: Rodrigo Ledesma
    Rodrigo Ledesma
  • May 15, 2022
  • 5 min read

Updated: May 19, 2022

Thank you if you have read my last two posts, I hope you have enjoyed them and you learned something from them. Now that I have shown how I got all my information, it is time to manipulate and prepare our data to be ingested into the ML algorithms.


Please feel free to visit my web page and download the dataset. I have been running the script described in Part 2 for almost a year, so I have accurate information and the time period is also enough for an analysis.


Also, you can take a look and download the Jupyter notebook I created for this purpose, at the end I will decide whether or not create a google colab file, It might be easier for shearing.



Data Cleansing and data preparation

I will describe each step with very basic concepts and easy code. So if you are an advanced user, please feel free to skip some areas of the article. My first step was to open a Jupyter Notebook and open the .xlsx. Here is an image of how the raw dataframe looks like:




I will start to analyze the second tab of the excel file, as it is the easiest and cleanest one. When analyzing the types of variables our dataFrame contains, we need to pay special attention to date, hour, and report as they are objects (strings) and we will have to process them before we can do any analysis.



Let’s start by analyzing Date:


As all dates have the same format it is relatively easy to get the month, day and year individually. First we will be splitting the string based on the “/” special character, then cast each variable into an integer and finally transform the array into a pandas dataFrame.


date = uni_raw_df['Date'].str.split(pat='/',expand=True)

date.columns=['month','day','year'] #rename the variables in the new df

date=date.astype('int')


Now we have each part of the date ready to be analyzed as an integer number. With this code, we finish the processing of the first variable and now we can move on with the hour variable. Lucky for us, this variable has a format hh:mm, so we only have to divide the string based on a colon, and cast to integers.


hour=uni_raw_df['hour'].str.split(pat=':',expand=True)

hour.columns=['hour','minute']

hour = hour.astype('int')


Treating categorical values

A machine learning model such as a regression or a neural network is not able to analyze numerical data together with characters or strings, so we need to assign a numerical value to each category. Here we have an excellent example regarding the report variable. Think about this variable as if your favorite tv weather report tells you every 10 minutes how the weather is. He will be using words such as “cloudy”, “clear sky” or “heavy rain”. To encode these categories sklearn gives us different alternatives. If we use the famous OneHotEncoding we will have n boolean variables where n is the number of categories, this might increase too much the number of variables for our analysis. The second option will be to do OrdinalEncodign, where the script assigns a number to each category. As we are not able to choose the number, this technique will not make any sense as light rain will have a value of 1 and scatter clouds a value of 8. So it might be better to assign the values manually. Anyhow, we will be taking the 3 approaches and discussing later on which one brings the best results at the end of the day.


One Hot Encoding


There are a lot of blogs that will give you immense amounts of information regarding how OneHotEncoding works so I will not go deeper on the topic and I will go straight to my case. I have 17 categories in the report variables and with oHe I will make an independent boolean variable for each. Here is how:


from sklearn.preprocessing import OneHotEncoderoHe = OneHotEncoder(handle_unknown='ignore')
oHe_report = pd.DataFrame(oHe.fit_transform(uni_raw_df[['report']]).toarray())
oHe_report.columns=['heavy intensity rain', 'light rain', 'broken clouds',
       'moderate rain', 'mist', 'overcast clouds', 'clear sky',
       'scattered clouds', 'thunderstorm with rain', 'few clouds',
       'thunderstorm', 'shower rain', 'very heavy rain', 'fog', 'haze',
       'thunderstorm with light rain', 'light intensity drizzle']
oHe_report.astype('int')

Analyzing line by line, we need to import the library from sklearn, then we create an object oHe. The next line creates a pandas dataframe that will transform all the data from the variable report in uni_raw_df. The last two lines are pretty simple, they rename the columns of the newly created dataFrame and transform their elements into integers. Very easy technically, let’s take a look at the result:


As described before, we will have an independent variable for each state and there will be a 1 if in the new variable if the original value was the one described by that variable and a zero in all the other new variables.


Ordinal Encoding


Instead of turning the strings into boolean values, OrdinalEncoding will give a natural number from 1 to n-1 where n is the number of states the categorical variable contains. The code is pretty simple too, so let’s take a look at it:


from sklearn.preprocessing import OrdinalEncoderord_enco = OrdinalEncoder()
oE_report = ord_enco.fit_transform(uni_raw_df[['report']])
oE_report_df = pd.DataFrame(oE_report).astype('int')
oE_report_df.columns=['report']

As we did with oHe first we imported the library and created an object, then fit the function and transformed the report variable. Finally, we cast the datatype to integer and renamed the column.


Manual Encoding


As mentioned before, the Ordinal Encoding is not the best option as there is no logic behind how the numerical values were assigned, so we will be doing that step manually. For this, I will base my logic on the fact that the worse the report (wind, rain, and thunderstorms) the worse the numerical value will be. This is the encoding I will use:



And actually, this will be the only code needed for the transformation. So technically we are done now. we have different dF for the different types of encoding.


Create the final dataframe and clean unwanted values

I will start by analyzing the distributions of only the first ride, the Harry Potter ride we have been working with so far. My first step was to take a look at all the unique values and as expected, there are some special values we want to consider first.



For example, we have a string “closed” and null values. So before plotting any diagram, let's manipulate them and get rid of the useless data. even though Universal data is cleaner than Disney’s it still contains strings and Nan. to handle this, I will delete all rows that contain nan and assign to Closed a big number to try to make a correlation.


rideHP1=pd.DataFrame(uni_raw_df['Harry Potter and the Forbidden...'].replace({'Closed':200}))


This line of code creates a different dataFrame which replaces all the “closed” string values, with a 200 numerical value. but our job is not done yet, because we still have empty values. For taking care of them, I had to look for them manually and it turns out they were at the end of the document. So I simply got rid of them by dropping the rows.


Now I did a simple histogram of the values and it was very impressive to notice that most of the time, this ride has recorded waiting times below 30 mins. A good sign.




Now I will make an append of all the new variables we have created so far into a new dataframe and I will proceed to clean the outliners, which in this case was the 200 representing times where the ride was closed. The next step will be to normalize the concatenated dataFrame with the following code:


from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()
scaler.fit(independent_vars)
independent_vars_norm = pd.DataFrame(scaler.fit_transform(independent_vars))

Data normalization or standardization (the one you choose) is extremely important as it will allow the algorithm to train and create a model easily. As all values are in the same range (from 0 to 1)



Now that we have our dataFrame ready, it is time to start creating our first ML model. So stay tuned for part 4.




 
 
 

Comments


bottom of page