Week 2  Optimization algorithms
20171022Content:
Optimization algorithms
Optimization algorithms
x^{{i}}, y^{{i}}: the i^{th} mini batch.
For example if we have m = 5000000 and the batch size = 1000, then there are 5000 mini batches.
(Source: Coursera Deep Learning course)
Passing this for loop is called doing 1 epoch
of training (means a single pass through the training set).
Batch gradient descent: 1 epoch allows us to take only 1 gradient descent step.
Minibatch gradient descent: 1 epoch allows us to take (say) 5000 gradient descent step.
Understanding minibatch gradient descent
(Source: Coursera Deep Learning course)
Minibatch size = m: Batch gradient descent (BGD).
Minibatch size = 1: Stochastic gradient descent (SGD).
In practice, Minibatch size = (1, m):
 If you use BGD: costly computation per iteration.
 If you use SGD: lose the speed up from vectorization.
If small training set: just use BGD.
If bigger training set: minibatch size = 64, 128, 256, 512…
Make sure minibatch size fits in CPU/GPU memory.
Exponentially weighted averages
(Source: Coursera Deep Learning course)
Blue scatter: θ_{i} = x (temperature of day i is x, 1 ≤ i ≤ 365).
Red line (Moving average/Exponentially weighted averages of the daily temperature): v_{0} = 0, v_{i} = βv_{i1} + (1  β)θ_{i}.
Intuition: v_{i} is between v_{i1} and θ_{i}. The higher β is, the closer v_{i} is to v_{i1}.
Another intuition: v_{i} as approximately averaging over $ \frac{1}{1  \beta}$ days’ temperature.
Understanding exponentially weighted averages
Implementing exponentially weighted averages
V_{θ} = 0
V_{θ} = βV_{θ} + (1  β)θ_{1}
V_{θ} = βV_{θ} + (1  β)θ_{2}
…
Bias correction in exponentially weighted averages
Problem
(Source: Coursera Deep Learning course)
This is because the initialization of v_{0} = 0.
How to fix?
v_{i}(new) = v_{i}(old) / (1 β^{t})
Gradient descent with momentum
This almost always works faster than the standard gradient descent.
This is what happens when you’re using Minibatch Gradient Descent:
(Source: Coursera Deep Learning course)
Gradient descent with momentum:
On iteration t:
Compute dW, db on current minibatch.
VdW = β.VdW + (1  β).dW
Vdb = β.Vdb + (1  β).db
W = W  α.VdW
b = b  α.Vdb
This smooths out the steps of gradient descent:
Imagine you take a ball and roll it down a bowl shape area:
 The derivatives term dW, db: provide acceleration to the ball.
 Momentum term VdW, Vdb: represent the velocity.
 β: plays the role of friction (prevents the ball from speeding up without limit).
Hyperparameters: α, β. The most comment value of β is 0.9 (average on last 10 gradient iterations).
People don’t usually use bias correction in gradient descent with momentum because only after about 10 iterations, your moving average will have already warmed up and is no longer a bias estimate.
RMSprop
Root Mean Square prop
On iteration i:
Compute dW, db on current minibatch.
SdW = β2.SdW + (1  β2).dW**2
Sdb = β2.Sdb + (1  β2).db**2
W = W  α.dW / (sqrt(SdW) + ε)
b = b  α.db / (sqrt(Sdb) + ε)
By putting momentum and RMSprop together, you can get an even better optimization algorithm.
Adam optimization algorithm
Adam = Adaptive moment estimation
Hyperparameters choice:
 α: needs to be tune
 β1 = 0.9
 β2 = 0.999
 ε = 10^{8}
Learning rate decay
Learning rate decay = slowly reduce learning rate over time.
Learning rate decay
and become hyperparameters you need to tune.
Other learning rate decay methods

: exponentially decay.

or

Discrete staircase: decrease learning rate (say by half) after some iterations.

Manual decay: manually decrease learning rate.
The problem of local optima
Most point of zero gradients (zero derivatives) are not local optima like points in the left graph, but saddle points in the right graph.
If in a very high dimension space (say 20000), for a point to be a local optima, all of its 20000 directions need to be increasing or decreasing, and the chance of that happening is maybe very small (2^{20000}).
So the local optima is not problem, but:
Problem of plateaus
A plateau is a region where the derivative is close to zero for a long time.
Plateaus can really slow down learning.
Summary:
 You’re unlikely to get stuck in a bad local optima (so long as you’re training a reasonably large neural network, lots os parameters, and the cost function J is defined over a relatively high dimensional space).
 Plateaus can make learning slow, and some of the optimization algorithms may help you.