- Recap
- Gradient descent; Backprop.
- Minimizes a loss which relates to classification accuracy but is not actually classification accuracy.
- Minimize the loss is hard enough.
- Learning rate vs convergence.
- High dimensional. Multiple weights.
- Normalize.
- Treat each dimensions independently – Rprop / Quick prop
- Momentum methods – Momentum / Nestorov
- Gradient descent; Backprop.
- Incremental Update
- Gradient Descent
- Iterate through all the samples, update the function once.
- Incremental Update / Stochastic Gradient Descent
- Update the function at all the samples, but keep the adjustments small.
- Basic: For weights
.
- Initialize
- Compute
.
- Update
.
- Iterative update until loss converge.
- Caveats
- Order of presentation matters. To avoid cyclic behavior. Need to randomly permute samples at each epoch.
- If the data are close to each other, then the benefits improves. If the data are diverse, then the benefits decreases.
- Correcting function for individual instances will lead to never ending updates. Must shrink the learning rate to prevent this.
- Fastest converging series
- Fastest converging series
- Comparison
- Convergence
- SGD
- Define as number of iterations taken to get within
of the optimal solution.
.
- Strongly convex functions, optimal learning rate
. Iterations of
.
- Generally convex function, an
convergence of
using a learning rate of
.
- Define as number of iterations taken to get within
- Batch Gradient Descent
- In comparison, batch convergence take
iterations.
- In comparison, batch convergence take
- SGD
- Expectation, Variance
- Recap, the empirical risk is an unbiased estimate of the expected loss.
- SGD:
,
.
- Batch:
,
.
- Overall: Fast convergence but high variance. -> Mini-batch.
- Mini-batch
- Convergence:
. Use vector processing to speed up mini-batch.
- Convergence:
- Convergence
- Gradient Descent
- Trend method
- Background: Convergence depends on learning rate
- Simple technique: fix learning rate until the error plateaus, then reduce learning rate by a fixed factor.
- Advanced methods: Adaptive updates, where the learning rate is itself determined as part of the estimation.
- Momentum, Nestorov’s, ADAM, etc.
- Variance-normalized step
- RMS Prop: Scaling by second moment
- Updates are by parameter
- The mean squared derivative is a running estimate of the average squared derivative. Show it as
.
- General Idea: scale down with large mean squared derivatives. scale up with small mean squared derivatives.
- ADAM: RMS prop with momentum
- RMS Prop: Scaling by second moment
- Background: Convergence depends on learning rate
- Divergence
- Convergence depends on the divergence function chosen.
- L2 divergence. – Long favored. However, for classification problems, L2 is no convex. Because of loss function – softmax.
- KL divergence. – Better when the intent is classification.
- Convergence depends on the divergence function chosen.
Leave a comment