Stay humble. Stay hungry. Stay foolish.

  1. Recap
    1. Neural nets can be trained via gradient descent that minimizes a loss function
    2. Backpropagation can be used to derive the derivatives of the loss
    3. Backpropagation is not guaranteed to find a true solution
    4. For large networks, the loss function may have a large number of unpleasant saddle points
  2. Convergence
    1. Question: Assume training arrives at a local minimum. Does it always converge? How long does it take?
    2. Initial Idea: Second derivative. A basic idea based on quadratic function.
      1. However, for random functions, the nth derivative does not guarantee a global maximum/minimum.
      2. It is simply the best we can do. As an approximation.
    3. Convex Loss Functions
      1. Convex: continuously curving upward. connect any two points above the surface without intersecting it.
      2. Caveat: not convex. Most loss functions are caveat.
    4. Convergence of Gradient Descent
      1. Definition
        1. Converging: updates arrive at the fixed point.
        2. Jittering: bouncing around the optimal.
        3. Diverging: swing out the optimal
      2. Convergence rate
        1. R = \frac{|f(x^{(k+1)}) - f(x^*)|}{|f(x^{(k)}) - f(x^*)|}
        2. If R is constant (or upper bounded) the convergence is linear.
          1. In another word, it is arriving at the solution exponentially fast.
    5. Convergence for quadratic surfaces
      1. Minimize E = \frac{1}{2} aw^2 + bw + c.
      2. w^{(k+1)} = w^{(k)} - \eta\frac{dE(w^{(k)})}{dw}.
      3. One step to the global minimal
        1. Taylor series f(x) = f(x_0) + f'(x_0)(x-x_0) + \frac{1}{2}f''(x_0)(x-x_0)^2.
        2. First-order derivative f'(x_0) + f''(x_0)(x-x_0) = 0.
        3. Based on x_1 = x_0 - \eta f'(x_0), we have \eta=\frac{1}{f''(x_0} = \frac{1}{a}.
      4.  Different cases based on the step size
        1. \eta < \eta_{opt}, converge monotonically.
        2. 2\eta_{opt} > \eta > \eta_{opt}, oscillating convergence.
        3. \eta > 2\eta_{opt}, divergence.
    6. Convergence for generic differentiable convex objectives (Scalar)
      1. Approximate using taylor series to the second order.
        1. E \approx E(w^k) + (w - w^{(k)})\frac{dE(w^{(k)})}{dw} + \frac{1}{2}(w - w^{(k)})^2 \frac{d^2E(w^{(k)})}{dw^2}.
      2. Similar logic
        1. \eta_{opt} = (\frac{d^2E(w^{(k)})}{dw^2})^{-1}.
    7. Convergence for generic differentiable convex objectives (Vector)
      1. Functions
        1.  General:
          1. E = g(\textbf{w}). \textbf{w} = [w_1, ..., w_N].
        2.  Simple quadratic convex function (when A is symatrix):
          1. E = \frac{1}{2}w^TAw + w^Tb + c = 0.
          2. Since E^T = E. A always symatric.
        3. Combination of quadratic functions (when A is diagonal):
          1. E = \frac{1}{2} \sum_i(a_{ii}\textbf{w}^2_{i} + b_i\textbf{w}_i) + c.
          2. All a_{ii} values are all positive.
          3. A sum of N independent quadratic functions.
        4. Ideal
          1. As we use the same size of every directions. The step size need to be smaller than 2 times of the smallestes optimal step size along all the dimensions.
        5.  Reality
          1. Convergence is particularly slow if \frac{{max}_i\eta_{i,opt}}{{min}_i\eta_{i,opt}} is large.
          2. The condition number is small
        6. Solution
          1. Normalize / Scale the axes
            1. All of them have identical axis.
            2. E = \frac{1}{2}\textbf{w}^T A \textbf{w} + \textbf{b}^T\textbf{w} + c
            3. \hat{\textbf{w}} = A^{0.5}\textbf{w}.
            4. E = \frac{1}{2}\hat{\textbf{w}}^T\hat{\textbf{w}} + \hat{\textbf{b}}^T\hat{\textbf{w}} + c.
            5. \textbf{w}^{(k+1)} = \textbf{w}^{(k)} - \eta A^{-1} \nabla_{\textbf{w}}E(\textbf{w}^{(k)})^TThe optimal step size Issues
          2. The Hessian
            1. Non-convex functions, Hessian may not be positive semi-definite. The algorithm can diverge.
          3. The learning rate
            1. Having \eta > 2\eta_{opt} can help escape local minimal.
            2. Decaying learning rate
              1. Linear \eta_k = \frac{\eta_0}{k+1}.
              2. Quadratic \eta_k = \frac{\eta_0}{(k+1)^2}.
              3. Exponential \eta_k = \eta e^{-\beta k}.
            3. Limitation
              1. \sum\eta_i = \infty. Guarantee wondering everywhere.
              2. \sum \eta_i^2 = c. Shrink.
            4. Different steps sizes of different parameters
              1. No. Because steps are tied to the gradient. Cannot guarantee convergence.
              2. Derivative e-inspired algorithms. Rprop, Quick Prop.
                1. Rprop
                  1. Recalculate gradient. If the gradient doesn’t change sign, take a bigger step.
              3. The momentum methods
                1. Maintain a running average of all past steps.
                2. In directions convergence is smooth, average will have a large value.
                3. In directions estimate swings, positive and negative swings will cancel out in the average.
                4. \Delta W^{(k)} = \beta \Delta W^{(k-1)} - \eta \nabla_W Loss(W^{(k-1)})^T.
              4. The Nestorov’s Accelerated Gradient
                1. \Delta W^{(k)} = \beta \Delta W^{(k-1)} - \eta \nabla_W Loss(W^{(k-1)} + \beta \Delta W^{(k-1)})^T

Tags

Leave a comment