Day 53(DL) — Convex Vs Non-Convex & Saddle points in NN

The objective of any deep learning requirement is to reach the global minima where the loss will be extremely low. Till now we’ve discussed the cost function is convex in nature where we have only global minima. But for deep neural networks that is not the case, the loss function is Non-Convex. It implies, we have many local optimas and only one global minimum. Let’s compare both functions to gain better intuition.

Convex function:

Fig 1 — shows convex function from wiki

Non-Convex function:

Fig 2 — shows non-convex from wiki
  • As we can see, the convex function has a nice property of having only one minimum point which makes it easier to optimize. Conversely, NCF has multiple minimum points and one global minima.

Saddle points: In deep NNs, the cost function always tends to have saddle points where the gradients are zero(minimum value). Here the problem is plateaus, often referred to as flat surfaces where there is no change in the gradients(zero learning). If the model gets stuck in a plateau, it will slow down the learning process and sometimes there will not be any learning at all.

Fig 3 — displays saddle points in the loss function from wiki
  • We have various optimization techniques followed to escape from the flat surfaces. These methods are an enhanced version of vanilla gradient descent. Since vanilla gradient descent suffers from getting stuck in the local minima and has a slow convergence rate.

For this discussion, we will check out the working principle behind Gradient descent with momentum.

Gradient Descent with momentum: This technique greatly helps the model to reach the global minima by taking significant steps and crossing over the local minima with the momentum. It uses the exponentially weighted average to update the weights.

  • If we take the entire batch of training samples to calculate the weights, then during one epoch only once the weights will get updated. On the other hand, if we split the entire training set into mini-batches, then the number of times the weight updates happen will be more and thus the model can rapidly reach the global minima.

curr_val(exponentially weighted avg) = beta * (prev_grad) + (1-beta) * curr_grad

Bias Correction: During the initial mini-batches, there will not be much of the past data and this will create a bias since 0.9(beta) weightage is given to the history and only 0.1%(1-beta) weight assigned to current data. In order to make the learning unbiased, the entire term is divided by bias factor(1-beta^t) and this value produces effective results during the initial stage and will not have any impact in the later phase. Since ‘t’ corresponds to the iteration, initially the denominator becomes high as the iteration increases the denominator will become small(implying previous weights have more influence).

The momentum (beta) must be higher to smooth out the update because we give more weight to the past gradients. It is recommended to use the default value for β = 0.9 but if required, it can be tuned between 0.8 to 0.999.

In momentum, instead of using dW and db independently for each epoch, we take the exponentially weighted averages of dW and db.

VdW = β * VdW + (1 — β) * dW

Vdb = β * Vdb + (1 — β) * db

After calculating exponentially weighted averages, we will update our parameters.

W = W — learning rate *VdW

b = b — learning rate * Vdb


(1) It will not be able to handle saddle points where the gradient is almost the same and is not changing much.

(2) overshoots the target and has to take a lot of U-turns to reach the objective.

Recommended Reading:

AI Enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store