Lecture 5 Convergence, Learning Rates, and Gradient Descent

Recap
1. Neural networks are universal approximators (provided right architecture)
2. We must train them to approximate any function (architecture, weights, and biases)
3. Neural networks are trained to minimize a total loss on a training set (empirical risk minimization)
4. We use variants of gradient descent to do so.
5. The gradient of the error with respect to the neural network is computed through backpropagation.
Training Neural Nets by Gradient Descent
1. Total training error: $Loss = \frac{1}{T} \sum_t Div(Y_t, d_t; W_1, ..., W_k)$ .
2. Initialize all weights.
3. For every layer compute:
  1. $\nabla_{W_k}Loss = \frac{1}{T}\sum_t \nabla_{W_k}Div(Y_t, d_t)$
  2. $W_k = W_k - \eta\nabla_{W_k} Loss^T$ .
4. Until Loss has converged.
Forward Computation
1. Iterate for k = 1, .., N
2. $y_i^{(0)} = x_i$ .
3. For j = 1:layer-width:
  1. $z_j^{(k)} = \sum_i w_{ij}^{(k)}y_i^{(k-1)}$ .
  2. $y_j^{(k)} = f_k(z_j^{(k)})$ .
Gradients: Backward Computation
1. Initialize Gradient w.r.t network output
  1. $\frac{\partial Div}{\partial y_i} = \frac{\partial Div(Y, d)}{\partial y_i^{(N)}}$ .
  2. $\frac{\partial Div}{\partial z_i^{(N)}} = f_k'(z_i^{(N)}) \frac{\partial Div}{\partial y_i^{(N)}}$
2. For ; For ;
  1. $\frac{\partial Div}{\partial y_i^{(k)}} = \sum_j w_{ij}^{(k+1)} \frac{\partial Div}{\partial z_i^{(k+1)}}$
  2. $\frac{\partial Div}{\partial z_i^{(k)}} = f_k'(z_i^{(k)}) \frac{\partial Div}{\partial y_i^{(k)}}$
  3. $\forall j \frac{\partial Div}{\partial w_{ij}^{(k+1)}} = y_i^{(k)} \frac{\partial Div}{\partial z_{j}^{(k+1)}}$ .
Special cases
1. Assumptions:
  1. Neuron computation doesn’t affect the same layer or previous layer neuron computation.
  2. Neuron outputs combined through weighted addition.
  3. Activations are differentiable.
2. All these assumptions can be violated. (In Slides)
3. Examples:
  1. Vector activations (violates (1)).
    1. Example: Softmax
      1. $y_i^{(k)} = \frac{exp(z_i^{(k)})}{\sum_jexp(z_j^{(k)})}$
      2. $\frac{\partial{y_i^{(k)}}}{\partial{z_i^{(k)}}}$
        
        $= y_i^{(k)}(1 - y_i^{(k)})$ if i == j
        
        $= -y_i^{(k)}y_j^{(k)}$ if i != j
Overall Approach
1. For each data instance, forward pass and backward pass
2. Actuall loss is the sum of the divergence over all traning instances $Loss = \frac{1}{|\{X\}|} \sum_X Div(Y(X), d(X))$ .
3. Actuall gradient is the sum / average of the derivatives computed for each training instace $\nabla_W Loss = \frac{1}{|\{X\}|}\sum_X \nabla_W Div(Y(X), d(X))$
Vector formulation
1. Problem Statement
  1. Assume the width at layer k-1 is $D_{k(k-1)}$ , at layer k is $D_k$ .
  2. $y_{k-1}, z_{k-1}$ are a $D_{k-1}$ dimensional column vectors.
  3. $y_k, z_k$ are a $D_k$ dimensional column vectors.
  4. $W_k$ is a $D_k \times D_{k-1}$ matrix.
  5. $z_k = W_k y_{k-1} + b_k$ . $y_k = f_k(z_k)$ .
2. Forward: $Y = f_N(W_Nf_{N-1}(...f_1(W_1x+b_1)+...)+b_N)$ .
3. Backward: $Div(Y,d) = Div(f_N(W_Nf_{N-1}(...f_1(W_1x+b_1)...)+b_N), d)$ .
4. Jocobian Matrix Math Recap:
  1. Definition: The derivative of a vector function w.r.t vector input is called a Jacobian.
    1. Jocobian matrix: first order paritial derivatives of vector-valued functions.
    2. Hessian matrix: second order partital derivatives of scalar-valued functions
  2. Formula: $\Delta y = J_y(x) \Delta z$
  3. Properties:
    1. Scalar activation: Jocobian is a diagonal matrix.
    2. Vector activation: Jocobian is a full matrix.
  4. Special case:
    1. Affine function: $z = Wy + b$ . $J_y = W$ .
    2. Vector derivatives:
      1. For vector functions of vector inputs: $y = f(g(x))$ , $z = g(x)$ , $y = f(z)$ => $J_y(x) = J_y(z)J_z(x)$ .
      2. For scalar functions of vector inputs: $D = f(g(x))$ , $\Delta_x D = \Delta_z(D)J_z(x)$ .
    3. Scalar functions of affine functions: $D = f(Wy + b)$ , $\Delta_y D = \Delta_z(D) W$ .
5. Backward propagation pass:
  1. Apply chain rule (vector activation at the last layer).
  2. Set $y_N = Y, y_0 = x$ .
  3. Initialize: Compute $\nabla_{y_N} Div = \nabla_Y Div$ .
  4. For layer k = N downto k = 1:
    1. Compute Jacobian matrix .
      1. Require intermediate values computed in the forward pass.
    2. Recursion:
      1. $\nabla_{z_k}Div = \nabla_{Y_k} Div J_{Y_k}(z_k) \quad (\nabla_{z_k} {Y_k} = J_{Y_k}(z_k))$ .
      2. $\nabla_{y_{k-1}}Div = \nabla_{z_k}Div W_k \quad (\nabla_{y_{k-1}}Z_k = W_k)$
    3. Gradient computation:
      1. $\nabla_{W_k}Div = y_{k-1} \nabla_{Z_k}Div \quad (\nabla_{W_k} z_k = y_{k-1})$
      2. $\nabla_{b_k}Div = \nabla_{z_k}Div \quad (\nabla_{b_k} z_k = 1)$
Note
1. Does backprop do the right thing?
  1. In classification problems, the classification error is a non-differentiable function of weights. The divergence function is a proxy. It does not minimize classification error.
    1. Bias-variance tradeoff
      1. Perceptron – low bias. high variance
      2. Backprop – high bias. low variance
    2. Backprop not find a separating solution even if the solution is within the class of function learnable by the network
2. Does backprop always find global optima? How about local optima? (Loss surface)
  1. Popular hypothesis
    1. In larget networks, saddle points are far more common (exponentially) than local minima
    2. most local minima are equivalent and close to the global minimum
  2. Saddle point:
    1. slope zero.
    2. increase in some directions (eigenvalues of Hessian positive), but decrease in others (eigenvalues of Hessian negative).
    3. gradient descent often gets stuck in saddle points.