- Recap
- Gradient descent can be sped up by incremental updates.
- Convergence can be improved using smoothed updates.
- The choice of divergence affects both the learned network and results.
- Batch Normalization (Mini-batches)
- Problem: covariate shifts. Data inside mini-batches will not be uniformly sampled. There is a correlation between the samples.
- Solution: Move all subgroups into a standard location.
- Formula:
- Covariate shift to standard position
. (add
to avoid divide by zero).
- Shift to right position
.
- Covariate shift to standard position
- Note on derivatives:
- The divergence for each
depends on all the
within the minibatch.
- The divergence for each
- Back-propagation:
.
;
.
.
.
- Inference:
- Compute average over all training minibatches
.
.
- Apply to inference instances
- Compute average over all training minibatches
- Over-fitting
- Unconstrained – Large weights and sharp changes
- Constraining the weights to be low will force slower perceptrons and smoother output response.
.
- MLP naturall impose constraints. Current layer works on the output of previous layer, which is already smoothed. Deeper neural network provides smoother output.
- Unconstrained – Large weights and sharp changes
- Bagging
- Sampling training data and train several different classifiers.
- Dropout
- Training: For each input, at each iteration, “turn off” each neuron with a probability of
.
- Each input is going to see a different network at each path. For N neurons, there are
possible sub-networks.
- Without dropout, a non-compressive layer may just clone its input to its output.
- Each input is going to see a different network at each path. For N neurons, there are
- Formula:
- Each neuron has the following activation
. D is a Bernoulli variable takes values 1 with probability
.
- The expected output of the neuron is
- Each neuron has the following activation
- Testing: Simply scale the output of the neurons by
.
- Testing time: scale the weights with
. OR
- Training time: scale the weight with
.
- Testing time: scale the weights with
- Variants:
- Zoneout; Dropconnect; Shakeout; Whiteout.
- Training: For each input, at each iteration, “turn off” each neuron with a probability of
- Other heuristics
- Early stopping: stop over-fitting
- Gradient clipping: set a ceiling on derivative value. Typical value is 5.
- Data augmentation: distorting examples. rotation. streching. etc.
- Normalize the input.
- Initialization techniques.
- Setting up a problem:
- Obtain training data
- Choose network architecture
- Choose divergence function
- Choose heuristics
- Choose optimization algorithm
- Grid search on hyper parameters
- Train.
Leave a comment