Coursera Deep Learning Course 2 Week 1 notes: Practical aspects of Deep Learning
2017-10-20Setting up your Machine Learning Application
Train/Dev/Test Sets
Applied ML is a highly iterative process:
- You start with a simple idea.
- You start coding and try it, and get the result.
- Base on the outcome, you may refine the idea… and try to find a better one.
How quickly you can go around this cycle?
It seems impossible to get the correct value of hyperparameters at the first time.
Data is divided into:
- Training set.
- Hold-out cross validation set, or development set, or dev set.
- Test set.
The workflow: you keep training your algorithm on training set, use dev set to evaluate its performance, then you can take your best model and run it on test set in order to get an unbias estimate of how your algorithm is doing.
Common proportion:
- 70/30: if you don’t have dev set.
- 60/20/20.
The goal of dev set is to help you quickly see which model is better, so dev set only need to be large enough, not exactly 20%. Similarly to test set.
So if the data is big:
- 98/1/1
- 99.5/0.4/0.1
Mismatch train/test distribution
Training set: cat pictures from internet (may have high quality image).
Dev/test sets: cat pictures from users using your app (may have low quality image).
Rule of thumps: make sure dev and test set come from same distribution.
Not having a test set might be okay. (Only dev set.)
Bias/Variance
The classify is high bias = The classify is underfitting the data.
The classify is high variance = The classify is overfitting the data.
Assume that optimal error = 0%:
Train error | 1% | 15% | 15% | 0.5% |
---|---|---|---|---|
Dev errror | 11% | 16% | 30% | 1% |
high variance | high bias | both | low bias, low variance |
Basic Recipe for Machine Learning
High bias? (training data performance):
- Bigger network.
- Training longer.
- Neural Networks architecture search.
High variance? (dev set performance):
- More data.
- Regularization.
- Neural Networks architecture search.
We don’t have tools to reduce bias/variance without hurting the other one => Bias Variance tradeoff.
Regularizing your neural network
Regularization
L2 regularization:
L1 regularization:
Regularization term in neural network:
where is called Frobenius norm. (L2 for matrix).
The alternative name for L2 regularization is weight decay.
Why regularization reduces overfitting?
Regularization makes , thus reduce the impact of a lot of hidden units, and the neural network may become simpler:
Another intuition: in case of using as activation function:
If large, is penalized much and become small, and small, then will be roughly linear:
Every layer will be roughly linear, and as a result: the neural network is just a linear network.
Thus regularization reduces overfitting.
Dropout Regularization
Implementing dropout (“Inverted dropout”)
We define a variable keep_prob
.
At layer (say) 3, create a random vector d[3] which is the same size with a[3]:
d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob
a3 = np.multiply(a3, d3)
a3 /= keep_prob
For example keep_prob = 0.8
, a3
will be reduced by 1 - keep_prob
= 0.2 percent (20% of the elements of a3
will be zeroed out), in order to not reduce the expected value of z4=w4.a3 + b4
, we need to divide a3
by keep_prob
.
The line a3 /= keep_prob
is called inverted dropout technique: no matter what you set the value of keep_prob
to, the expected value of a3
is ensure to remain the same.
You should randomly zero different hidden units: e.g. in the iteration 1 of gradient descent, you may zero out some hidden units; the second iteration, may be you zero out different part of hidden units, and so on…
Making predictions at test time
Do not using drop out, just perform feed forward to compute the output.
Understanding Dropout
1st intuition: doing with simpler neural network has regularization effect.
2nd intuition: can’t rely on any one feature (because they can be randomly zeroed out), so have to spread out the weight. And by spreading out the weight, this will tend to have the Shrinking the square norm of the weight effect. (Similar effect to L2 regularization).
Each layer can have different keep_prob
value:
- With layer you don’t worry about overfitting: can set
keep_prob
value to be high (say 1.0). - Otherwise: decrease
keep_prob
value at the layer where you’re most worry about overfitting.
Dropout can technically apply for input layer, but in practice it is not used that often.
Downside: cost function J is not well-defined any longer, so we can’t plot J as a function of number of iteration to debug.
Other regularization methods
Increase data size by data augmentation: rotate image, flipping image horizontally…
Early stopping
Downside: In ML, you need to care about Optimizing cost function J and Avoiding overfitting. ML will be easier to think about when you have tools for Optimizing J, then it is completely a separate task to not overfit (reduce variance).
This principle is sometimes called Orthogonalization, and this is the idea that you want to think about 1 task at a time.
Early stopping couples these 2 tasks: you no longer can work on these 2 problems independently; because when you stop gradient descent early, you’re sort of breaking the Optimization of J, so you’re not doing a great job reducing J; you also simultaneously try to not overfit.
So instead of using different tools to solve 2 problems, you’re using 1 tool that kind of mixes the 2.
Alternative method: using L2 regularization, then you can just train the neural network as long as possible. But downside: you have to try a lot of value of lambda (Early stopping doesn’t need to do this, and it just runs gradient descent once).
Setting up your optimization problem
Normalizing inputs
Vanishing/Exploding gradients
When you’re training a very deep network, your derivative (slope) can sometimes get very very big (or very small). This makes training difficult.
Weight Initialization for Deep Networks
A partial solution for this problem: more careful choice of random weight initialization.
wl = np.random.randn(shape) * np.sqrt(var)
where var
is either:
-
for ReLU activation function.
-
for tanh activation function.
-
Another version of
var
:
All of this is to set the variance of wl
to be equal to var
(thus avoid zl
exploding).
Numerical approximation of gradients
Gradient Checking
Gradient checking for a neural network
Take W[1], b[1]… and reshape into a big vector θ.
Take dW[1], db[1]… and reshape into a big vector dθ.
Gradient Checking Implementation Notes
Don’t use in training - only in debug (because it’s expensively computational).
If algorithm fails grad check, look at components (correspond W, b) to try to identify bug.
Remember regularization.
Grad check doesn’t work with dropout.
Run at random initialization, perhaps again after some training (after some number of iterations).