Week 5 notes: Neural Networks: Learning
2017-07-18Content:
Introduction
I think week 5 is the hardest week so far, I had to do tons of research on Backpropagation in order to fully understand it. It was really great when I eventually answered all the questions that came to mind.
This note is my attempt to explain the formulas appear in the week 5’s videos, hope it helps you to understand it too.
Cost Function and Backpropagation
Cost Function
Backpropagation Algorithm
As usual, to minimize using Gradient Descent, we need to compute itself and:
As it’s hard to solve this formula directly; we can modify it using a trick, which is applying the Chain rule:
The second factor can be modified as:
with is the number of units in -th layer (not counting bias unit), we iterate the sum from 0 including the bias unit.
And then we define a new value - “error” of node in layer :
So the formula can be re-written as:
So now we need to compute for all ; and there is a recursive formula that helps us do this:
with is the sigmoid function, hence
And the vectorization form of it:
with .* is the element-by-element (element wise) multiplication operation.
As we can observe: in order to calculate in the current layer (), we need to know of its next layer (, and hence the name Backpropagation); and the starting point of this recursive formula is the of the output layer (). Now let’s compute for output layer:
At output layer, we have :
As we’re using Stochastic Gradient Descent (SGD) method (this algorithm will update the matrix for each training example), now will be the cost function with respect to a single training example (differentiate this from the overall cost function):
Again, at the output layer, we have:
Substitute this into , and then take derivative:
As a result:
Now we will prove :
Because we have:
so:
Substitute this into (4):
Backpropagation algorithm:
- Training set
- Set (for all i, j, l).
- For i = 1 to m
- Set
- Perform forward propagation to compute for l = 2, 3.., L
- Using , compute
- Compute
- Update:
- If then
- If then
And we have:
Backpropagation intuition
There is no value for bias units (actually we can still compute it but no need to use it).
Backpropagation is totally opposite to Feedforward propagation, with each training example :
- Feedforward propagation:
- input:
- calculate using
- Backpropagation:
- input:
- calculate using
and the weight matrices ( matrices) are used similarly in two algorithms.
Backpropagation in Practice
Implementation Note: Unrolling Parameters
Previously to optimize the cost function of Linear Regression and Logistic Regression, in Octave we do:
with and are vectors.
Now in Neural Networks, they are not vectors anymore (but matrices)=> we “unroll” them into vectors.
For example, if we have:
and we want to put them in a single vector, in Octave we can do this:
: put all theta1’s columns into 1 single column vector. And to get back:
Gradient checking
Two sides difference estimate:
this will give us a numerical estimate of the gradient at that point.
In case is a vector:
In implementation (again, is a vector unrolled from weight matrices):
and then check that , with is the derivative we got from Backpropagation.
Implementation Note:
- Implement backprop to comput DVec.
- Implement numerial gradient check to compute gradApprox.
- Make sure they give similar values.
- Turn off gradient checking. Use backprop code (DVec) for learning.
Important:
- Be sure to disable your gradient checking code before training your classifier. If you run numerical gradient computation on every iteration of gradient descent (or in the inner loop of costFunction(…)), your code will be very slow (because the gradient checking code is very computationally expensive).
The main reason why we use the backprop algorithm rather than the numerical gradient computation method during learning is that the numerical gradient algorithm is very slow (and the backprop one is much more efficient).
Random Initialization
In Neural Networks, initializing all values as 0 (or a same value) doesn’t work. Because after each update, parameters corresponding to inputs going into each hidden units (headed units) are identical => highly redundant representation => prevents the network from doing something interesting.
Solution: Random Initialization (symmetry breaking).
Initialize each to a random value in :
The here has nothing to do with the used in gradient checking.
One effective strategy for choosing is to base it on the number of units in the network. A good choice of is:
where and are the number of units in the layers adjacent to .
Putting it together
The first thing we need to do when training a networks: pick some network architecture. => how many hidden layers, how many units in each layer…
Number of input units: Dimension of feature .
Number of output units: Number of classes.
Reasonable default:
- 1 hidden layer (most common).
- if >1 hidden layer, have same number of hidden units in every layer (usually the more the better). And also the number of hidden units is usually comparable to the dimension of x (number of features): 3 or 4 times of that (or greater).
Steps to train a neural networks:
- Randomly initilize weights (usually small value near 0).
- Implement forward propagation to get for any
- Implement code to compute cost function .
- Implement backprop to compute partial derivatives .
(In implementation, we use a for loop:
)
- Use gradient checking to compare computed using backpropagation vs. using numerical estimate of gradient of . Then disable gradient checking code.
- Use gradient descent or advanced optimization method with backpropagation to try to minimize as a function of parameters .
In Neural Networks, is a non-convex function; so gradient descent algorithm can get stuck in local minima. In practice, this is not a huge problem; because it can find a good local optima if it doesn’t even get to the global one.
Application of Neural Networks
No notes.
Application of Neural Networks
References
[2] How to derive errors in neural network with the backpropagation algorithm?