- Recap
- Neural nets can be trained via gradient descent that minimizes a loss function
- Backpropagation can be used to derive the derivatives of the loss
- Backpropagation is not guaranteed to find a true solution
- For large networks, the loss function may have a large number of unpleasant saddle points
- Convergence
- Question: Assume training arrives at a local minimum. Does it always converge? How long does it take?
- Initial Idea: Second derivative. A basic idea based on quadratic function.
- However, for random functions, the nth derivative does not guarantee a global maximum/minimum.
- It is simply the best we can do. As an approximation.
- Convex Loss Functions
- Convex: continuously curving upward. connect any two points above the surface without intersecting it.
- Caveat: not convex. Most loss functions are caveat.
- Convergence of Gradient Descent
- Definition
- Converging: updates arrive at the fixed point.
- Jittering: bouncing around the optimal.
- Diverging: swing out the optimal
- Convergence rate
- If R is constant (or upper bounded) the convergence is linear.
- In another word, it is arriving at the solution exponentially fast.
- Definition
- Convergence for quadratic surfaces
- Minimize
.
.
- One step to the global minimal
- Taylor series
.
- First-order derivative
.
- Based on
, we have
.
- Taylor series
- Different cases based on the step size
, converge monotonically.
, oscillating convergence.
, divergence.
- Minimize
- Convergence for generic differentiable convex objectives (Scalar)
- Approximate using taylor series to the second order.
.
- Similar logic
.
- Approximate using taylor series to the second order.
- Convergence for generic differentiable convex objectives (Vector)
- Functions
- General:
.
.
- Simple quadratic convex function (when A is symatrix):
.
- Since
. A always symatric.
- Combination of quadratic functions (when A is diagonal):
.
- All
values are all positive.
- A sum of
independent quadratic functions.
- Ideal
- As we use the same size of every directions. The step size need to be smaller than 2 times of the smallestes optimal step size along all the dimensions.
- Reality
- Convergence is particularly slow if
is large.
- The condition number is small
- Convergence is particularly slow if
- Solution
- Normalize / Scale the axes
- All of them have identical axis.
.
.
The optimal step size Issues
- The Hessian
- Non-convex functions, Hessian may not be positive semi-definite. The algorithm can diverge.
- The learning rate
- Having
can help escape local minimal.
- Decaying learning rate
- Linear
.
- Quadratic
.
- Exponential
.
- Linear
- Limitation
. Guarantee wondering everywhere.
. Shrink.
- Different steps sizes of different parameters
- No. Because steps are tied to the gradient. Cannot guarantee convergence.
- Derivative e-inspired algorithms. Rprop, Quick Prop.
- Rprop
- Recalculate gradient. If the gradient doesn’t change sign, take a bigger step.
- Rprop
- The momentum methods
- Maintain a running average of all past steps.
- In directions convergence is smooth, average will have a large value.
- In directions estimate swings, positive and negative swings will cancel out in the average.
.
- The Nestorov’s Accelerated Gradient
- Having
- Normalize / Scale the axes
- General:
- Functions
Leave a comment