Lecture 7 Acceleration, Regularization and Normalization

Recap
1. Gradient descent; Backprop.
  1. Minimizes a loss which relates to classification accuracy but is not actually classification accuracy.
  2. Minimize the loss is hard enough.
    1. Learning rate vs convergence.
    2. High dimensional. Multiple weights.
      1. Normalize.
      2. Treat each dimensions independently – Rprop / Quick prop
      3. Momentum methods – Momentum / Nestorov
Incremental Update
1. Gradient Descent
  1. Iterate through all the samples, update the function once.
2. Incremental Update / Stochastic Gradient Descent
  1. Update the function at all the samples, but keep the adjustments small.
  2. Basic: For weights .
    1. Initialize
    2. Compute $\nabla_{W_k}Div(Y_t, d_t)$ .
    3. Update $W_k = W_k - \eta \nabla_{W_k} Div(Y_t, d_t)^T$ .
    4. Iterative update until loss converge.
  3. Caveats
    1. Order of presentation matters. To avoid cyclic behavior. Need to randomly permute samples at each epoch.
    2. If the data are close to each other, then the benefits improves. If the data are diverse, then the benefits decreases.
    3. Correcting function for individual instances will lead to never ending updates. Must shrink the learning rate to prevent this.
      1. Fastest converging series $\eta_k \propto \frac{1}{k}$
3. Comparison
  1. Convergence
    1. SGD
      1. Define as number of iterations taken to get within $\epsilon$ of the optimal solution. $|f(W^{(k)}) - f(W^*)| < \epsilon$ .
      2. Strongly convex functions, optimal learning rate $\frac{1}{k}$ . Iterations of $O(\frac{1}{\epsilon})$ .
      3. Generally convex function, an $\epsilon$ convergence of $\frac{1}{\sqrt{k}}$ using a learning rate of $\frac{1}{\sqrt{k}}$ .
    2. Batch Gradient Descent
      1. In comparison, batch convergence take $O(log(\frac{1}{k}))$ iterations.
  2. Expectation, Variance
    1. Recap, the empirical risk is an unbiased estimate of the expected loss.
    2. SGD: $E[Loss(f(X;W), g(X))] = E[div(f(X;W), g(X))]$ , $var(Loss) = \frac{1}{N} var(div)$ .
    3. Batch: $E[Loss(f(X;W), g(X))] = E[div(f(X;W), g(X))]$ , $var(Loss) = var(div)$ .
  3. Overall: Fast convergence but high variance. -> Mini-batch.
  4. Mini-batch
    1. Convergence: $O(\frac{1}{\sqrt{bk}} + \frac{1}{k})$ . Use vector processing to speed up mini-batch.
Trend method
1. Background: Convergence depends on learning rate
  1. Simple technique: fix learning rate until the error plateaus, then reduce learning rate by a fixed factor.
  2. Advanced methods: Adaptive updates, where the learning rate is itself determined as part of the estimation.
    1. Momentum, Nestorov’s, ADAM, etc.
  3. Variance-normalized step
    1. RMS Prop: Scaling by second moment
      1. Updates are by parameter
      2. The mean squared derivative is a running estimate of the average squared derivative. Show it as $E[\delta_W^2 D] = E[(\delta_W D])^2$ .
      3. General Idea: scale down with large mean squared derivatives. scale up with small mean squared derivatives.
    2. ADAM: RMS prop with momentum
Divergence
1. Convergence depends on the divergence function chosen.
  1. L2 divergence. – Long favored. However, for classification problems, L2 is no convex. Because of loss function – softmax.
  2. KL divergence. – Better when the intent is classification.