Hello everybody! Welcome to this new segment of my blog where we will be experimenting with Deep Learning. If you are new to Machine Learning, Deep Learning is a branch of ML that mimics the way humans learn by using artificial neural networks, which at the same time use perceptrons to create complicated and interconnected networks to process information and obtain features or abstractions to make predictions for example.
If this is the first time you visit my blog, the purpose of this series of articles is intended to create a Machine Learning model to predict how long will it take an average visitor to wait in line at a Disney or Universal park before they can ride their favorite rollercoaster and with this information, optimize their visit by scheduling parks in the day where their favorite parks are less crowded.
For this post, I will be basing my redaction on Dr. Brownlee’s post :
https://machinelearningmastery.com/tensorflow-tutorial-deep-learning-with-tf-keras/
An excellent article to start with a very simple example, feel free to go there and complete the information that I give here. If you want even more technical explanations, I recommend reading the book “Hands-on machine learning using sklearn and TensorFlow” by Aurelien G. Chapters on deep learning and excellent.
Now you can ask yourselves, ok if there are all these excellent tutorials, why should I keep on reading this guy’s post? Excellent question, here we will be facing a real problem, our data will not be pretty either something simple such as the iris dataset, we will be describing how to tune the algorithm to a real dataset and we will be solving some problems that come within them.
TensorFlow
Just like SkLearn, TensorFlow is a library that python can import. It was developed and maintained by Google, and its magic resides in the fact that you can create deep neural networks with only a few lines of code while using their API Keras. There are a lot of different tutorials for installing TensorFlow, please take into consideration that it will be different if you use Windows and Mac. Personally, I will give you a quick tip, if you are not doing a process that requires long and heavy processing, don’t worry about creating environments and installing things, my advice is to create a google account and use Google Colaboratory. They will allow you to use CPUs and GPUs (a limited amount obviously) for free without any installation.
https://colab.research.google.com/drive/19ER0hSiVRSDOqhqVv-vdOdDAwKqKm1Wg?usp=sharing
Difference between using perceptrons and regular ML models
So far we have used a series of algorithms to create models to make predictions, you can see the last posts to give yourself an idea of how to use SVM or Regressions. These algorithms have to be defined with their parameters and then fit. They were quite simple, algorithms that use perceptrons have a different life cycle. It is extremely similar but we need to understand it perfectly to set our bases correctly.
The stages of creating models with perceptrons are:
Define the model: in this phase, we will be stating if our model will have a sequential or functional architecture, also we will be creating all the layers and configuring each one of them individually
Compile the model: This step involves setting parameters such as the optimizer that can be “stochastic gradient descent” or “adam” this is the technique used to adjust the weights on the neural network based on the actual performance. Here we also need to set the parameters of the loss function and there are 3 primarily loss functions depending on the type of exercise we are tackling:
‘binary_crossentropy‘ for binary classification.
‘sparse_categorical_crossentropy‘ for multi-class classification.
‘mse‘ (mean squared error) for regression.
Fit the model: This is the exact same as with the other algorithms, we will tell the model, which set of data it should use to train, and also something new is, we will tell the model how many epochs to use, in other words, how many times it should train on the data. One epoch means the model will only use the data once to train. The more epochs, the more times our data will be used to train, and the better the results will be, of course here we have to take into consideration overfitting and training time. The last parameter will be the batch size, which is the number of rows that will be analyzed in one loop.
Evaluate the model/Make predictions These two steps are the exact same when compared to the ones we used on the previous algorithms.
First model: Multilayer Perceptron Model (MLP)
The first model we will be constructing on this long and exciting path will be MLP, let’s take a quick look into what the model is.
First let's define what perception is:
Perceptron is another word for referring to an artificial neuron. This representation is trying to copy the physical structure of a real biological neutron. The job of a neuron is to activate depending on an external stimulus. In computing, this process was copied using perceptron. It has several parts, inputs, processing, and output. The perception as the biological neuron must make the decision of whether to activate or not based on the external stimuli, inside our brain electrical pulses are received by the neurons and they analyze whether or not to activate and emit another electric pulse throughout the network. A perceptron does a similar process.
As you can see in the image above, the perceptron is divided into 5 different areas, let’s take a look at each one of them individually. The first area is where the inputs are located, inputs will be the features or the raw data that we obtained in a numerical representation, you can’t feed a neural network with strings. Then one of the most important parts of the perceptron comes into place, the weights. In this case, as you can see, each input has a weight associated, and the weight is nothing more than a float number which will be calibrated later on based on the performance of the predictions. The third part is the net input or summation zone. The function of the zone is only to add up the product off all inputs times their respective weight associated. Then the activation function, this is the most important part, each perceptron will have a threshold associated with it, if the sum of all weights and inputs is bigger than this value, the neuron will produce a result (1) otherwise it will not (0). This result will be the output of the neuron and will be used as an outcome or to feed another perceptron.
At the end of each cycle, the algorithm will analyze the performance of each perception, and based on a parameter called “learning rate” it will modify the weights of each neuron to adjust it and improve its performance.
In the above equation, Z represents the output of the net input function. X represents the input values of the perceptron and as you can see, each input is multiplied by its weight, and then all results are added. This operation is done with matrices and Wt is a transposed matrix that will be multiplied by the matrix containing all inputs.
Now let’s talk about how the perceptron will decide whether or not to activate. The activation function is a mathematical expression whose threshold will allow its activation. If the sum of all the products is bigger than this threshold, the perceptron will produce a 1, otherwise, it will produce a 0. we will see later on that different activation functions have their own advantages when we use the gradient descent because of the mathematical properties of their derivatives.
Now that we have understood how the perceptron works, let’s see how to use a set of them to create complex structures and obtain hidden features from the data using a Multilayer Perceptron model.
MLP
The idea behind the multilayer processing model is easy to understand once we understood the perceptron concept. MLP is a concatenation of different groups of perceptrons, each one is connected to all of the others in the next layer.
Let’s get our hands dirty and do some coding:
The first thing we need to do apart from importing the data into our local environment is to change the labels of the target variable. In our previous models, we were able to use [1,2 and 3] unfortunately the MLP mode asks us to change it to be in the range [0,1 and 2], this is extremely simple:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense# change the y class to [0-2]
a = hp_oHe.HP_Forbidden_clean.replace([1,2,3], [0,1,2])#Create the final dataframe
df=pd.DataFrame(a)
hp2=hp_oHe.drop('HP_Forbidden_clean',axis=1)
hp_oHe=pd.concat([hp2, df], axis=1)
now let’s begin with the first step mentioned above: Defining the model
# define model
model1=Sequential()model1.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
model1.add(Dense(8,activation='relu', kernel_initializer='he_normal'))
model1.add(Dense(3, activation='softmax'))
Let’s analyze it line by line:
First, we create a class called model1 with the sequential method
Then we add our input layer with the method .add(). This layer is dense (fully connected) it has 10 neurons, we will be using Rectified Linear activation function and the input shape will be the same size as the number of features we have. In this specific case n_features = 27 as in one hot encoding we are working with 27 different features.
Next, we have another dense layer with 8 neurons and ReLu activation function too. The layer which is between input and output is called “hidden layers”
Finally, we have the output layer, this is extremely important as the activation function within them will determine our task. Sigmoid will be used for regression, but we are pursuing a multiclass classification so we shall use softmax. The number 3 indicates the range of classes for the classification in this case we say our classes will be [0, 1, and 2].
#compile the model
model1.compile(optimizer='sgd', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
Now the compilation, here the first parameter is the optimizer which is Stochastic Gradient Descent, this is a mathematical algorithm that looks for the minimum value of error on a function based on the use of derivatives, looking for a local minimum. Our loss or error will be calculated using categorical cross-entropy, we can modify these parameters and tune our model, finally as we have been doing so far, we will be using accuracy as our metric.
Now let’s fit our model
# fit the model
history = model1.fit(X_train_os, y_train_os, epochs=150, batch_size=32, verbose=2,validation_data=(X_test, y_test))
So we will be storing the data in a variable called history to plot some results later on, but as you can see we have given the function our train data, also I asked the function to train the model using the data 150 times with the parameter epochs, my batch size, for now, will be the default of 32, but we will need to tune it later on. One important variable to tune is verbose, if it is equal to 0, then there will be no messages displayed throughout the train, I set it to 2 because I want to see the progress on each epoch, finally, I am giving the model some validation data.
Please visit my google colab notebook to take a look at the whole process and to see how I treated the dataset:
Now let’s train the model and see how it behaves:
These are the last epochs of training, as you can see, the model has a training accuracy of approx 62% and the final accuracy with the test data is about 61%. The model is behaving very poorly but it is also extremely simple. So now let’s make some tuning.
Different types of gradient descent
As we stated before, Gradient Descent is a technique used to calibrate the weights of a neural network. The main idea relies on finding the local minimum of the error function using the derivatives of that function. In other words, this method focuses on finding the parameters that will produce the smallest amount of error. To achieve this objective, TensorFlow is capable of calculating the error gradient with different quantities of training samples. In other words, we can choose how many rows of data will be analyzed to determine the gradient and update the weights of our model.
There are different techniques within the function, based on the number of training samples:
Batch Gradient Descent. The batch size is set to the total number of examples in the training dataset.
Stochastic Gradient Descent. Batch size is set to one.
Minibatch Gradient Descent. Batch size is set to more than one and less than the total number of examples in the training dataset.
Pretty simple isn’t it? The code we have been using so far is perfect to create a function and tune this hyperparameter, as today I learned how to use a class, let’s create a class with a nice name, and a function and try out the three methods to compare the model’s metrics:
This way to paste code is extremely pretty, now we can see the code is very straightforward. We have created the same neural network as before, with an input layer of 28 neurons, 3 hidden layers with 20, 20, and 15 neurons each and finish a 3-neuron output layer with a softmax activation function to make the multiclass classification. This architecture has only one little modification, it adds the configuration of 2 new parameters, the learning rate for the Gradient Descent and also the momentum, Everything is compiled together in line 15 and the parameters are set in lines 17 and 18. Finally, we create a plot showing the train and test accuracy’s evolution through time.
The first parameter we will be tunning will be Batch Size, or in other words, how many training samples will be considered to estimate the error of the model. By default we the batch size is set to 32, but here we will be trying an array of possibilities. This is the code we will be using:
And here are our results:
As we can see, we used Stochastic and MiniBatch Gradient Descent. In this specific case, when we use the stochastic GD the training accuracy does not improve at all after 100 epochs, on the other hand, the test accuracy fluctuates rapidly and never improves over 60%. When we increase the batch size starting from 32 units, the behavior improves drastically and we can appreciate that the more we increase the size, the worse the model behaves. Finding the sweet spot at 32. Now that we know this parameter, let’s tune the following one, the learning rate. But first, let’s see what it is.
Learning Rate
As we mentioned before, the purpose of using gradient descent is to find the minimum value of a function. In our specific case, we are modeling a mathematical function that represents the amount of error the algorithm has based on our data. Let’s imagine that the error function can be represented as a parabolic function in a 2D space. We will start looking for the minimum value in a random position, and we will be moving to the left or to the right until we find a local or global minimum. The learning rate is the size of the step the function takes to the left or to the right.
This is an example of a 3D function and the learning rate.
We need to be careful when tuning the learning rate because there is no established value. In some cases, the function might need to travel a lot to find the minimum, and if the LR is small it will never converge. Conversely, if the function is small and the LR is too big, it will be bouncing and will never reach the local or global minimum.
Now let’s code the testing script and plot the results. I will fix the batch size to 32, and as this parameter allows my code to run faster, I will increase the number of epochs to 200.
Here we can see something quite interesting, our error function does not work correctly with a big learning rate. In the graphs, we can appreciate that the model fluctuates a lot when the LR is between 1 and 0.1. We can find our local minimum better when the learning rate = 0.01, with an accuracy of 82% on the test dataset.
Momentum:
The mathematical concept of momentum is quite robust and I will not be explaining it, but I do recommend you to read the following post in case you are interested in the base which is the moving average.
In our specific case, we are not calculating the complete derivate function of the whole loss function. Instead, we are using a small batch of examples to perform this task, because of the selection of objects the calculation might not be completely accurate, and here is where a moving average comes in handy. The momentum will allow our function to better estimate which is closer to the actual derivate. To tune this parameter, we will be testing an array of values from the range 0 to 1:
As we can see, the momentum is helping our test accuracy to fluctuate less. Concluding that the optimal value will be equal to 0.9 and where we will obtain an accuracy of 81.4%.
Thank you so much for reading! In our next post, we will be analyzing other methods of optimizing our MLP model and in a future post, we will use google cloud to store our model and make predictions on the cloud. Keep tuned for more.
Comments