Day 43 (DL)— Implementation of a binary Classifier using Simple Neural Networks

Let’s jump right into the implementation of an FNN(Fully connected neural networks) for classifying iris dataset. Since we are creating a binary classifier, we can consider only two target values(not all three). We will incorporate both approaches (i.e) using python from scratch, as well as, Keras from TensorFlow.

Table of contents:

  • Parameter Initialization
  • Forward propagation
  • Cost function
  • Backward pass
  • Prediction

Implementation of ANN from scratch using Python & Numpy: For our experiment, we can take 4 layered NN with two hidden layers(remaining as input & output layers). The number of nodes(neuron) is as follows:

Input layer(First) = 4 (Matching no of features in the input dataset)

Second layer(hidden) = 5

Third layer(hidden) = 3

Final layer(output) = 1 . Since it is a binary classification

Fig1 — shows NN chosen for the use case

We can notice the input layer is just a representation of the values fed in. It does not have any activation functions or linear layer. The actual activation starts from the hidden layer. The activation function ‘relu’ is used for the intermediary layers whereas the final output layer uses ‘sigmoid’ as its activation.

Parameter Initialization: Similar to Machine learning, the learnable parameters for DL include weights and bias. But we need to initialize these values to random so that the model takes appropriate gradient steps to reach the minima(low cost).

The output of the parameters initialised for the layer_dims = [4,5,3,1] looks like below. We store each value in the form of a dictionary with {key: value} pair. As there are 3 layers which contain the internal parameters(weights & bias), we have corresponding pairs in the below output.

{‘W1’: array([[ 0.01788628, 0.0043651 , 0.00096497, -0.01863493, -0.00277388], [-0.00354759, -0.00082741, -0.00627001, -0.00043818, -0.00477218], [-0.01313865, 0.00884622, 0.00881318, 0.01709573, 0.00050034], [-0.00404677, -0.0054536 , -0.01546477, 0.00982367, -0.01101068]]), shape = 4 x 5

‘W2’: array([[-0.01185047, -0.0020565 , 0.01486148], [ 0.00236716, -0.01023785, -0.00712993], [ 0.00625245, -0.00160513, -0.00768836], [-0.00230031, 0.00745056, 0.01976111], [-0.01244123, -0.00626417, -0.00803766]]), shape = 5 x 3

‘W3’: array([[-0.02419083], [-0.00923792], [-0.01023876]]), shape = 3 x 1

‘b1’: array([[0., 0., 0., 0., 0.]]), shape = 1 x 5 ‘b2’: array([[0., 0., 0.]]), shape = 1 x 3 ‘b3’: array([[0.]]) shape = 1 x 1}

Sometimes the notation or order may vary in someone else’s code (i.e) for instance, the shape of ‘W1’ can be 5 x 4. In such cases, when the input training sample is multiplied with the weights then the order has to be changed accordingly. It will become more clear in the forward propagation.

Forward Propagation: In the forward pass, we apply the linear function followed by activation for each neuron(node). The hidden layers will use ‘relu’ as the activation whereas ‘sigmoid’ for final/output layer.

  • Since the first layer fetches only input, we start with ‘1’ corresponds to the first hidden layer. Using the initialised weights and bias, z = w*X + b is computed(X — represents inputs). This is followed by applying the ‘relu’ activation (i.e) taking the maximum of (0,z). If z is a negative value, then zero will be passed.
  • For the last layer, the calculation of ‘z’ is the same as above but the activation formula corresponds to sigmoid which is 1 / (1 + e ^ -z). All the values are stored in a variable called caches, that can be used for backpropagation. Notes: In python(arrays/lists/data frame), all the indexes start with zero.

Cost function: As the requirement falls under the category of binary classification, we will be incorporating binary cross-entropy as the loss function. This follows the exact rule of machine learning.

The cost will be a summed up values of overall loss, divided by the count of training samples.

Backward pass: Like any ML objective, here also we update the weights and bias(the learnable parameters) to get an efficiently performing model. It starts with the last layer and propagates through all the previous layers till it reaches the very first input layer. Here, we need to create two types of gradients (i.e) one for the current layer and another for the previous layers. Since backpropagation is taking gradient descent(first-order derivative), it completely depends on the activation function included and the corresponding parameters(weights & bias).

We need to pass the layer_dims, caches, parameters, actual output(train_y) and the learning_rate(hyperparameter) that controls the speed at which the model reaches the minima.

We already know the derivative of the sigmoid function combined with the loss function is (predicted — actual). In the below set of instructions, ‘dzprev’ is the derivative backpropagated from the loss function. z = wx+b, when we take the derivative w.r.t ‘w’ to get ‘x’. Then applying the chain rule we calculate ‘dwlast’ and similar approach for ‘dblast’. Once this is computed we can update the last layer weights and bias accordingly.

The next step is to update the weights and bias for the hidden layers. For ‘relu’ the derivative is ‘1’ for positive values and ‘0’ otherwise. Again applying the chain rule we calculate the derivatives for passing to the previous layers. Please refer to the recommended video for a clear understanding of nested backpropagation. We always use the chain rule to link each updatable parameter with the loss function, as the aim is to determine the optimized weights and bias.

Every time the backpropagation happens, the overall cost should reduce(i.e) an indication of the right direction. If not, then there is something incorrect in the way we evaluated the derivatives.

Prediction: During the prediction, the last updated weights are saved and passed as parameters to the model. Using the forward propagation with latest weights and bias, the output is predicted for the test data.

The same logic can be implemented using Keras from TensorFlow. For Deep Learning we can either use Pytorch or Tensorflow(Keras). Both are capable of handling tensors(expanded matrices, higher dimensions). Notes: when we are using google colab for deep learning, we can turn on the ‘GPU’ under Runtime -> Change runtime type.

The first statement Sequential defines it is a fully connected network(the layers get added consecutively).

The entire code is found in the Github repository.

Recommended Video:

AI Enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store