levendlee

Stay humble. Stay hungry. Stay foolish.

Lecture 6 Convergence, Loss Surfaces, and Optimization

Written in

April 7, 2020

by

levendlee

Introduction to Deep Learning, Machine Learning

Recap
1. Neural nets can be trained via gradient descent that minimizes a loss function
2. Backpropagation can be used to derive the derivatives of the loss
3. Backpropagation is not guaranteed to find a true solution
4. For large networks, the loss function may have a large number of unpleasant saddle points
Convergence
1. Question: Assume training arrives at a local minimum. Does it always converge? How long does it take?
2. Initial Idea: Second derivative. A basic idea based on quadratic function.
  1. However, for random functions, the nth derivative does not guarantee a global maximum/minimum.
  2. It is simply the best we can do. As an approximation.
3. Convex Loss Functions
  1. Convex: continuously curving upward. connect any two points above the surface without intersecting it.
  2. Caveat: not convex. Most loss functions are caveat.
4. Convergence of Gradient Descent
  1. Definition
    1. Converging: updates arrive at the fixed point.
    2. Jittering: bouncing around the optimal.
    3. Diverging: swing out the optimal
  2. Convergence rate
    1. $R = \frac{|f(x^{(k+1)}) - f(x^*)|}{|f(x^{(k)}) - f(x^*)|}$
    2. If R is constant (or upper bounded) the convergence is linear.
      1. In another word, it is arriving at the solution exponentially fast.
5. Convergence for quadratic surfaces
  1. Minimize $E = \frac{1}{2} aw^2 + bw + c$ .
  2. $w^{(k+1)} = w^{(k)} - \eta\frac{dE(w^{(k)})}{dw}$ .
  3. One step to the global minimal
    1. Taylor series $f(x) = f(x_0) + f'(x_0)(x-x_0) + \frac{1}{2}f''(x_0)(x-x_0)^2$ .
    2. First-order derivative $f'(x_0) + f''(x_0)(x-x_0) = 0$ .
    3. Based on $x_1 = x_0 - \eta f'(x_0)$ , we have $\eta=\frac{1}{f''(x_0} = \frac{1}{a}$ .
  4. Different cases based on the step size
    1. $\eta < \eta_{opt}$ , converge monotonically.
    2. $2\eta_{opt} > \eta > \eta_{opt}$ , oscillating convergence.
    3. $\eta > 2\eta_{opt}$ , divergence.
6. Convergence for generic differentiable convex objectives (Scalar)
  1. Approximate using taylor series to the second order.
    1. $E \approx E(w^k) + (w - w^{(k)})\frac{dE(w^{(k)})}{dw} + \frac{1}{2}(w - w^{(k)})^2 \frac{d^2E(w^{(k)})}{dw^2}$ .
  2. Similar logic
    1. $\eta_{opt} = (\frac{d^2E(w^{(k)})}{dw^2})^{-1}$ .
7. Convergence for generic differentiable convex objectives (Vector)
  1. Functions
    1. General:
      1. $E = g(\textbf{w})$ . $\textbf{w} = [w_1, ..., w_N]$ .
    2. Simple quadratic convex function (when A is symatrix):
      1. $E = \frac{1}{2}w^TAw + w^Tb + c = 0$ .
      2. Since $E^T = E$ . A always symatric.
    3. Combination of quadratic functions (when A is diagonal):
      1. $E = \frac{1}{2} \sum_i(a_{ii}\textbf{w}^2_{i} + b_i\textbf{w}_i) + c$ .
      2. All $a_{ii}$ values are all positive.
      3. A sum of $N$ independent quadratic functions.
    4. Ideal
      1. As we use the same size of every directions. The step size need to be smaller than 2 times of the smallestes optimal step size along all the dimensions.
    5. Reality
      1. Convergence is particularly slow if $\frac{{max}_i\eta_{i,opt}}{{min}_i\eta_{i,opt}}$ is large.
      2. The condition number is small
    6. Solution
      1. Normalize / Scale the axes
        
        All of them have identical axis.
        
        $E = \frac{1}{2}\textbf{w}^T A \textbf{w} + \textbf{b}^T\textbf{w} + c$
        
        $\hat{\textbf{w}} = A^{0.5}\textbf{w}$ .
        
        $E = \frac{1}{2}\hat{\textbf{w}}^T\hat{\textbf{w}} + \hat{\textbf{b}}^T\hat{\textbf{w}} + c$ .
        
        $\textbf{w}^{(k+1)} = \textbf{w}^{(k)} - \eta A^{-1} \nabla_{\textbf{w}}E(\textbf{w}^{(k)})^T$ The optimal step size Issues
      2. The Hessian
        
        Non-convex functions, Hessian may not be positive semi-definite. The algorithm can diverge.
      3. The learning rate
        
        Having $\eta > 2\eta_{opt}$ can help escape local minimal.
        
        Decaying learning rate
        
        Linear $\eta_k = \frac{\eta_0}{k+1}$ .
        
        Quadratic $\eta_k = \frac{\eta_0}{(k+1)^2}$ .
        
        Exponential $\eta_k = \eta e^{-\beta k}$ .
        
        Limitation
        
        $\sum\eta_i = \infty$ . Guarantee wondering everywhere.
        
        $\sum \eta_i^2 = c$ . Shrink.
        
        Different steps sizes of different parameters
        
        No. Because steps are tied to the gradient. Cannot guarantee convergence.
        
        Derivative e-inspired algorithms. Rprop, Quick Prop.
        
        Rprop
        
        Recalculate gradient. If the gradient doesn’t change sign, take a bigger step.
        
        The momentum methods
        
        Maintain a running average of all past steps.
        
        In directions convergence is smooth, average will have a large value.
        
        In directions estimate swings, positive and negative swings will cancel out in the average.
        
        $\Delta W^{(k)} = \beta \Delta W^{(k-1)} - \eta \nabla_W Loss(W^{(k-1)})^T$ .
        
        The Nestorov’s Accelerated Gradient
        
        $\Delta W^{(k)} = \beta \Delta W^{(k-1)} - \eta \nabla_W Loss(W^{(k-1)} + \beta \Delta W^{(k-1)})^T$

Tags

Leave a comment Cancel reply