# Day 43 (DL)— Implementation of a binary Classifier using Simple Neural Networks

Let’s jump right into the implementation of an FNN(Fully connected neural networks) for classifying iris dataset. Since we are creating a binary classifier, we can consider only two target values(not all three). We will incorporate both approaches (i.e) using python from scratch, as well as, Keras from TensorFlow.

*Table of contents:*

- Parameter Initialization
- Forward propagation
- Cost function
- Backward pass
- Prediction

** Implementation of ANN from scratch using Python & Numpy: **For our experiment, we can take 4 layered NN with two hidden layers(remaining as input & output layers). The number of nodes(neuron) is as follows:

Input layer(First) = 4 (Matching no of features in the input dataset)

Second layer(hidden) = 5

Third layer(hidden) = 3

Final layer(output) = 1 . Since it is a binary classification

We can notice the input layer is just a representation of the values fed in. It does not have any activation functions or linear layer. The actual activation starts from the hidden layer. The activation function ‘relu’ is used for the intermediary layers whereas the final output layer uses ‘sigmoid’ as its activation.

** Parameter Initialization: **Similar to Machine learning, the learnable parameters for DL include weights and bias. But we need to initialize these values to random so that the model takes appropriate gradient steps to reach the minima(low cost).

# intializing the weights for each layerdef intialize_weights(layer_dims):

np.random.seed(3)

parameters = {}

L = len(layer_dims)#initialise the value of weights based on the number of layers for i in range(1,L):

parameters['W'+str(i)] = np.random.randn(layer_dims[i-1],layer_dims[i]) * 0.01

parameters['b'+str(i)] = np.zeros([1, layer_dims[i]]) return parameters

The output of the parameters initialised for the layer_dims = [4,5,3,1] looks like below. We store each value in the form of a dictionary with {key: value} pair. As there are 3 layers which contain the internal parameters(weights & bias), we have corresponding pairs in the below output.

{‘W1’: array([[ 0.01788628, 0.0043651 , 0.00096497, -0.01863493, -0.00277388], [-0.00354759, -0.00082741, -0.00627001, -0.00043818, -0.00477218], [-0.01313865, 0.00884622, 0.00881318, 0.01709573, 0.00050034], [-0.00404677, -0.0054536 , -0.01546477, 0.00982367, -0.01101068]]), **shape = 4 x 5**

‘W2’: array([[-0.01185047, -0.0020565 , 0.01486148], [ 0.00236716, -0.01023785, -0.00712993], [ 0.00625245, -0.00160513, -0.00768836], [-0.00230031, 0.00745056, 0.01976111], [-0.01244123, -0.00626417, -0.00803766]]), **shape = 5 x 3**

‘W3’: array([[-0.02419083], [-0.00923792], [-0.01023876]]), **shape = 3 x 1**

‘b1’: array([[0., 0., 0., 0., 0.]]), **shape = 1 x 5 **‘b2’: array([[0., 0., 0.]]), **shape = 1 x 3** ‘b3’: array([[0.]]) **shape = 1 x 1**}

Sometimes the notation or order may vary in someone else’s code (i.e) for instance, the shape of ‘W1’ can be 5 x 4. In such cases, when the input training sample is multiplied with the weights then the order has to be changed accordingly. It will become more clear in the forward propagation.

** Forward Propagation: **In the forward pass, we apply the linear function followed by activation for each neuron(node). The hidden layers will use ‘relu’ as the activation whereas ‘sigmoid’ for final/output layer.

#forward propagation

def forward_propagation(layer_dims,train_x,parameters):

caches = []

Aprev = train_x

L = len(layer_dims)#forward propagation for all the layers except last layerfor i in range(1,L-1):

W = parameters['W'+ str(i)]

b = parameters['b' + str(i)]

Z = np.dot(Aprev, W) + b

Aprev = np.maximum(0,Z)

cache = Aprev, W,b

caches.append(cache)#forward propagation for the last layerW = parameters['W'+ str(L-1)]

b = parameters['b' + str(L-1)]

Zlast = np.dot(Aprev, W) + b

Alast = 1/(1 + np.exp(-Zlast))

cache = Alast, W, b

caches.append(cache) return caches

- Since the first layer fetches only input, we start with ‘1’ corresponds to the first hidden layer. Using the initialised weights and bias, z = w*X + b is computed(X — represents inputs). This is followed by applying the ‘relu’ activation (i.e) taking the maximum of (0,z). If z is a negative value, then zero will be passed.
- For the last layer, the calculation of ‘z’ is the same as above but the activation formula corresponds to sigmoid which is 1 / (1 + e ^ -z). All the values are stored in a variable called caches, that can be used for backpropagation.
*Notes: In python(arrays/lists/data frame), all the indexes start with zero.*

** Cost function: **As the requirement falls under the category of binary classification, we will be incorporating binary cross-entropy as the loss function. This follows the exact rule of machine learning.

def cost_calculate(predict_y,train_y): m = train_y.shape[0]

cost = -(np.dot(train_y.T, np.log(predict_y)) + np.dot((1- train_y).T, np.log(1-predict_y)))/m return cost

The cost will be a summed up values of overall loss, divided by the count of training samples.

** Backward pass: **Like any ML objective, here also we update the weights and bias(the learnable parameters) to get an efficiently performing model. It starts with the last layer and propagates through all the previous layers till it reaches the very first input layer. Here, we need to create two types of gradients (i.e) one for the current layer and another for the previous layers. Since backpropagation is taking gradient descent(first-order derivative), it completely depends on the activation function included and the corresponding parameters(weights & bias).

def backward_propagation(layer_dims, caches, parameters, train_y, learning_rate): #backward propagation for the last layer

#Extract the last array from the caches, as this corresponds to the final output L = len(layer_dims)

Acurr,Wcurr,bcurr = caches[L - 2]

Aprev,Wprev,bprev = caches[L - 3]

m = train_y.shape[0]

We need to pass the layer_dims, caches, parameters, actual output(train_y) and the learning_rate(hyperparameter) that controls the speed at which the model reaches the minima.

We already know the derivative of the sigmoid function combined with the loss function is (predicted — actual). In the below set of instructions, ‘dzprev’ is the derivative backpropagated from the loss function. z = wx+b, when we take the derivative w.r.t ‘w’ to get ‘x’. Then applying the chain rule we calculate ‘dwlast’ and similar approach for ‘dblast’. Once this is computed we can update the last layer weights and bias accordingly.

dzprev = (Acurr - train_y)

dwlast = np.dot(Aprev.T, dzprev)/m

dblast = np.sum(dzprev, keepdims = True, axis = 0)/mparameters['W' + str(L-1)]= parameters['W' + str(L-1)] - (learning_rate * dwlast)parameters['b' + str(L-1)]= parameters['b' + str(L-1)] - (learning_rate * dblast)

The next step is to update the weights and bias for the hidden layers. For ‘relu’ the derivative is ‘1’ for positive values and ‘0’ otherwise. Again applying the chain rule we calculate the derivatives for passing to the previous layers. Please refer to the recommended video for a clear understanding of nested backpropagation. We always use the chain rule to link each updatable parameter with the loss function, as the aim is to determine the optimized weights and bias.

for i in reversed(range(L-2)):

Anext,Wnext,bnext = caches[i+1]

Acurr,Wcurr,bcurr = caches[i]

if i == 0:

Aprev = train_x

else:

Aprev,Wprev,bprev = caches[i-1]

dzcurr = np.where(Acurr > 0,1,Acurr)

dzprev = np.multiply(np.dot(dzprev,Wnext.T), dzcurr)

dW = np.dot(Aprev.T,dzprev)/m

db = np.sum(dzprev, keepdims = True, axis = 0)/m parameters['W' + str(i+1)]= parameters['W' + str(i+1)] - (learning_rate * dW) parameters['b' + str(i+1)]= parameters['b' + str(i+1)] - (learning_rate * db)return parameters

Every time the backpropagation happens, the overall cost should reduce(i.e) an indication of the right direction. If not, then there is something incorrect in the way we evaluated the derivatives.

layer_dims = [4,5,3,1]

learning_rate = 0.15

iterations = 14900parameters = complete_model(layer_dims, train_x, train_y, learning_rate, iterations)The cost after iteration 0: 0.6931458662427084

The cost after iteration 1000: 0.6928176749538959

The cost after iteration 2000: 0.6926375234627491

The cost after iteration 3000: 0.6905539990663362

The cost after iteration 4000: 0.6674492997222916

The cost after iteration 5000: 0.4943033781872236

The cost after iteration 6000: 0.22287462328522417

The cost after iteration 7000: 0.09010404330852087

The cost after iteration 8000: 0.05181617021726566

The cost after iteration 9000: 0.035345626279470674

The cost after iteration 10000: 0.02656350008290693

The cost after iteration 11000: 0.021191446335718244

The cost after iteration 12000: 0.017610774821345116

The cost after iteration 13000: 0.015068752173866001

The cost after iteration 14000: 0.013179961770325327

** Prediction: **During the prediction, the last updated weights are saved and passed as parameters to the model. Using the forward propagation with latest weights and bias, the output is predicted for the test data.

predict = forward_propagation(layer_dims,test_x,parameters)[-1][0]

test_cost = cost_calculate(predict,test_y)

test_costarray([[0.0122371]])

The same logic can be implemented using Keras from TensorFlow. For Deep Learning we can either use Pytorch or Tensorflow(Keras). Both are capable of handling tensors(expanded matrices, higher dimensions). *Notes: when we are using google colab for deep learning, we can turn on the ‘GPU’ under Runtime -> Change runtime type.*

from keras.models import Sequential

from keras.layers import Activation, Dense

from keras import optimizersmodel = Sequential()

model.add(Dense(50, input_shape = (4, )))

model.add(Activation('relu'))

model.add(Dense(50))

model.add(Activation('relu'))

model.add(Dense(1))

model.add(Activation('sigmoid'))sgd = optimizers.SGD(lr = 0.01)

model.compile(loss = 'binary_crossentropy', metrics = ['accuracy'])model.fit(train_x, train_y, validation_data = (test_x, test_y), epochs = 30)

The first statement Sequential defines it is a fully connected network(the layers get added consecutively).

The entire code is found in the Github repository.

*Recommended Video:*