Lecture 4 Learning the network: Backprop

Recap
1. The MLP can represent any function
2. How to learn the function
  1. By minimizing expected error
  2. By sampling the function
  3. The empirical risk
  4. Empirical risk minimization
Math Basics
1. Gradients of scalar functions with multi-variate inputs
  1. $df(X) = \nabla_Xf(X) dX$ (scalar, row vector, column vector)
2. The inner product is maximum (minimum) when two vectors have the same (opposite) direction.
  1. Move $X$ opposite to $\nabla_Xf(X)$ .
Unconstrained Minimization of function
1. Closed-form solution
  1. Solve the for $X$ as $\nabla_Xf(X) = 0$ .
  2. Compute the Hessian Matrix at the candidate solution
    1. Hessian is positive definite (eigenvalues positive) -> local minima
    2. Hessian is negative definite (eigenvalues negative) -> local maxima
2. Iterative solution
  1. Overview: Begin with a guess and refine it iteratively
  2. 1D Example:
    1. Start from an initial guess $x_0$ for the optimal $x$ .
    2. Update the guess towards a (hopefully) “better” value of .
      .
      is the step size.
      1. Positive derivative -> move left decrease error
      2. Negative derivative -> move right decrease error
    3. Stop when $f(x)$ no longer decreases.
      $f'(x^k) = 0$ is no always guaranteed. For example, the global minimum is out of boundary.
  3. Gradient descent / ascent (multivariate)
    1. Find maximum $x^{k+1} = x^k + \eta^k \nabla_xf(x^k)^T$ .
    2. Find minimum $x^{k+1} = x^k - \eta^k \nabla_xf(x^k)^T$ .
Problem Statement
1. Goal: Given $(X_i, d_i)$ . Minimize $Loss(W) = \frac{1}{T} \sum_i div(f(X_i; W), d_i)$ . w.r.t. W.
2. Typical Network
  1. MLP
  2. Layered. Acyclic.
  3. Input layers + hidden layers + output layers.
3. Individual Neurons
  1. $y = f(\sum_i w_ix_i + b)$ .
  2. $f(x)$ is the activation function.
4. Activation functions
  1. Almost differentiable everywhere. Except ReLU. Use sub-gradient $f(x) = x$ for ReLU at $x = 0$ .
  2. Scalar Activations $y = f(x_1,x_2,..,x_k;W)$ .
  3. Vector Activations . In a layered network, each layer of perceptrons can be viewed as a single vector activation.
    1. Example: Softmax
      1. $z_i = \sum_j w_{ji}x_j + b_i$
      2. $y = \frac{exp(z_i)}{\sum_jexp(z_j)}$
  4. Difference between scalar/vector activations: Change one weight influence all the outputs.
5. Notations
  1. Input layer is 0th layer.
  2. Output of i th perceptron of the k th layer as $y_i^{(k)}$ .
  3. Represent weight of connection between i th unit of k-1 th layer and the j th unit and k th layer as $w_{ij}^{(k)}$ .
  4. The bias to the j th unit of the k th layer is $b_j^{(k)}$ .
  5. Input / Output
    1. Input : X, D dimensional vectors
    2. Desired Output: d, L dimensional vectors
    3. Actual Output: y, L dimensional vectors
  6. Output
    1. Real values: direct output
    2. Binary classifier: give representations
      1. Output activation (Usually sigmoid)
        
        Viewed as probability $P(Y = 1|X)$ .
    3. Multi-class classifier: One-hot representations
      1. N class. N binary outputs.
      2. Use vector activations. Posterior probability vector.
  7. Divergence
    1. Need to be differentiable
    2. Popular
      1. Real values: the (scaled) L2 divergence $Div(Y, d) = \frac{1}{2} (Y-d)^2$ .
      2. Binary classifier: the cross-entropy $Div(Y,d) = -dlogY - (1-d)log(1-Y)$ .
        
        Derivative: $-\frac{1}{Y} (d = 1)$ else $\frac{1}{1-Y} (d = 0)$ .
        
        Note: When Y = 1 and d = 1, the derivative is not 0. Because the output is restricted between {0, 1}.
      3. Multi-class classifier:
        
        Basic: $Div(Y,d) = -\sum_id_ilogy_i = -logy_c$ . (only one d_i is going to be 1)
        
        Derivative: $-\frac{1}{y_c}$ for the c-th component; 0 for the remaining component.
        
        Note: y_c < 1 the slope is negative, indicate increasing y_c will reduce divergence.
        
        Label smoothing: set the target output to be $[\epsilon, \epsilon, ..., (1-(K-1)\epsilon, \epsilon]$ with.
        
        Derivative: $-\frac{1-(K-1)\epsilon}{y_c}$ for the c-th component; $-\frac{\epsilon}{y_i}$ for the remaining component.
        
        Note: Get a reasonable derivate even for non-target classes.
      4. Total derivative: $\frac{dLoss}{dw_{i,j}^{(k)}} = \frac{1}{T} \sum_t \frac{dDiv(Y_t, d_t)}{dw_{i,j}^{(k)}}$ .
      5. Math Recap:
        
        $y = f(g_1(x), g_2(x), ..., g_M(x))$ .
        
        Derivative: $dy = (\frac{\partial f}{\partial g_1(x)} \frac{\partial g_1(x)}{dx} + ... + \frac{\partial f}{\partial g_M(x)} \frac{\partial g_M(x)}{dx}) dx$ .