Lecture 8 Batch Normalization, Dropout and other Regularization methods

Recap
1. Gradient descent can be sped up by incremental updates.
2. Convergence can be improved using smoothed updates.
3. The choice of divergence affects both the learned network and results.
Batch Normalization (Mini-batches)
1. Problem: covariate shifts. Data inside mini-batches will not be uniformly sampled. There is a correlation between the samples.
2. Solution: Move all subgroups into a standard location.
3. Formula:
  1. Covariate shift to standard position $u_i = \frac{z_i - \mu_B}{\sqrt{\delta_B^2 + \epsilon}}$ . (add $\epsilon$ to avoid divide by zero).
  2. Shift to right position $\hat{z}_i = \gamma\mu_i + \beta$ .
4. Note on derivatives:
  1. The divergence for each $Y_t$ depends on all the $X_t$ within the minibatch.
5. Back-propagation:
  1. $\frac{dDiv}{\hat{z}} = f'(\hat{z})\frac{dDiv}{dy}$ .
  2. $\frac{dDiv}{d\beta} = \frac{dDiv}{d\hat{z}}$ ; $\frac{dDiv}{d\gamma} = \mu \frac{dDiv}{d\hat{z}}$ .
  3. $\frac{dDiv}{d\mu} = \gamma\frac{dDiv}{d\hat{z}}$ .
  4. $\frac{\partial{Div}}{\partial{z_i}} = \frac{\partial{Div}}{\partial{\mu_i}}\frac{\partial{\mu_i}}{\partial{z_i}} + \frac{\partial{Div}}{\partial{\delta_B^2}}\frac{\partial{\delta_B^2}}{\partial{z_i}} + \frac{\partial{Div}}{\partial{\mu_B}}\frac{\partial{\mu_B}}{\partial{z_i}}$
  5. $\frac{\partial{Div}}{\partial{z_i}} = \frac{\partial{Div}}{\mu_i}\frac{1}{\sqrt{\delta_B^2+\epsilon}} + \frac{\partial{Div}}{\delta_B^2} \frac{2(z_i - mu_B)}{B} + \frac{\partial{Div}}{\partial{\mu_B}} \frac{1}{B}$ .
6. Inference:
  1. Compute average over all training minibatches
    1. $\mu_{BN} = \frac{1}{N batches} \sum_{batch}{\mu_B(batch)}$ .
    2. $\delta_{BN}^2 = \frac{B}{(B-1)N batches} \sum_{batch}{\delta_B^2(batch)}$ .
  2. Apply to inference instances
Over-fitting
1. Unconstrained – Large weights and sharp changes
  1. Constraining the weights to be low will force slower perceptrons and smoother output response.
  2. $L(W_1,W_2,...,W_K) = \frac{1}{T}\sum_t{Div(Y_t,d_t)} + \frac{1}{2}\lambda||w||^2$ .
2. MLP naturall impose constraints. Current layer works on the output of previous layer, which is already smoothed. Deeper neural network provides smoother output.
Bagging
1. Sampling training data and train several different classifiers.
Dropout
1. Training: For each input, at each iteration, “turn off” each neuron with a probability of .
  1. Each input is going to see a different network at each path. For N neurons, there are $2^N$ possible sub-networks.
  2. Without dropout, a non-compressive layer may just clone its input to its output.
2. Formula:
  1. Each neuron has the following activation $y_i^{(k)} = D\delta(\sum_j{w_{ji}^{(k)} y_j^{(k-1)} + b_i^{(k)}})$ . D is a Bernoulli variable takes values 1 with probability $\alpha$ .
  2. The expected output of the neuron is $y_i^{(k)} = \alpha\delta(\sum_j{w_{ji}^{(k)} y_j^{(k-1)} + b_i^{(k)}})$
3. Testing: Simply scale the output of the neurons by .
  1. Testing time: scale the weights with $\alpha$ . OR
  2. Training time: scale the weight with $\frac{1}{\alpha}$ .
4. Variants:
  1. Zoneout; Dropconnect; Shakeout; Whiteout.
Other heuristics
1. Early stopping: stop over-fitting
2. Gradient clipping: set a ceiling on derivative value. Typical value is 5.
3. Data augmentation: distorting examples. rotation. streching. etc.
4. Normalize the input.
5. Initialization techniques.
Setting up a problem:
1. Obtain training data
2. Choose network architecture
3. Choose divergence function
4. Choose heuristics
5. Choose optimization algorithm
6. Grid search on hyper parameters
7. Train.