Lecture 3 Learning, Empirical Risk Minimization, and Optimization

Recap

Modification to the original perceptron
1. Bias can be viewed as the weight of another constant input of 1
2. Activation functions are not necessarily threshold functions
Design the network
1. The structure of the network is feed-forward: No loops. Neuron outputs do not feedback to their inputs directly or indirectly.
2. The parameter of the network: The weights and biases.

Learning the Network: Determine these parameters.

Construct by hand: Shape decision boundaries

An automatic estimate of MLP

Ideal Target:
1. Definition:
  When $f(X; W)$ has the capacity to exactly represent $g(X)$ .
  $\widehat{W} = {argmin}_{W}\int_X div(f(X;W), g(X) dX$ .
  $div()$ is a divergence function (loss function) that goes to zero when $f(X;W) = g(X)$ .
2. Intuition:
  1. In other word, minimize the divergence to the target function.
Practical Problem
1. Problem: $g(X)$ is unknown. Cannot perform the integral on the input domain.
2. Solution: Sample the function.
  1. Assumption: The sampling of matches the natural distribution of .
    1. Note: The professor implies the assumption is i.i.d. However, $X$ may not be evenly sampled and the regularization is focused on the range where $X$ appears more frequently.
  2. Intuition:
    1. Fit the sample well and then hope it fits the function well.
Methodology:
1. A single perceptron (linear seperatable):
  1. Target: : A hyperplane formed by all vectors that is orthogonal to . (Here is not the inputs but the vectors on the hyperplane).
    1. Method: PLA.
      1. $O(X_i) = sign(W^TX_i)$ . If $O(X_i) \ne Y_i$ , then $W = W + Y_iX_i$ .
      2. Theory guarantee: Find hyperplane in a finite number of steps $(\frac{R}{\gamma})^2$ if linear separable.
        
        R: the length of the longest vector
        
        $\gamma$ : the best-case margin
  2. Intuition:
    1. If there is one dot represents 1 / -1, the ideal weight vector points to the dot/opposite to the dot.
    2. If there are more dots, keep combining the dots.
2. Multiple perceptrons (not linearly separable):
  1. PLA: It cannot work to learn an MLP. Exponential complexity of assign intermediate labels. (Require input-output relation for every perceptron).
  2. Greedy algorithms: Adaline and Madeline
    1. For each perceptron, flipping each instance at a time
  3. Continuous value activation function:
    1. Problem
      1. Individual neurons’ weight can change significantly without changing the overall error. -> non-differentiable.
      2. Real-life data are not linearly separable.
    2. Solution:
      1. Continous value activation function.
        
        Popular choice: Sigmoid
        
        Intuition: Increasing X and calculate the average value of Y in a small interval, the probability of Y=1 form a sigmoid curve.
        
        The logistic regression model:
        $P(Y = 1|X)$ , $\sigma(x) = \frac{1}{1 + e^{-x}}$ .
      2. As an activiation function is differentiable:
        $z = \sum w_ix_i$ , $\frac{dy}{dz} = \sigma'(z)$ , $\frac{dy}{dw_i} = \sigma'(z)x_i$ , $\frac{dy}{dx_i} = \sigma'(z)x_i$ .
      3. The entire network is differentiable as well: $y = \sigma(\sum w_{i,j}^{k-1} y_i^{k-1}$
      4. Minimize the expected error:
        $\widehat{W} = argmin_{W} E[div(f(X;W), g(X))]$
        
        Emphase the more frequent values of X.
      5. Empirical Risk Minimization:
        
        Ideal: $E[div(f(X;W), g(X))] = \int_X div(f(X;W), g(X) dX$ (Discrete) (E posteriori probablilty)
        
        Empirical estimate: $E[div(f(X;W), g(X))] \approx \frac{1}{N}\sum div(f(X_i;W_i), d_i)$
        
        Loss(W) = $\frac{1}{N} \sum div(f(X_i;W_i), d_i)$
        
        Problem Statement:
        
        Given a traing set of input-output pairs: $(X_1, d_1) ..., (X_N, d_N)$ .
        
        Minimize the loss function
        Loss(W) = $\frac{1}{N} \sum div(f(X_i;W_i), d_i)$
        w.r.t W
        
        The problem of function minimization
  4. Summary:
    1. Threshold-activation requires solving a hard combinatorial-optimization problem
    2. Continuous activation functions with non-zero derivatives to enable us to estimate network parameters
      1. Logistic activation perceptron computes the a posteriori probability fo the output given the input
    3. The differentiable divergence between the output of the network and the desired output (Both all activation functions and divergence need to be differentiable).
    4. Optimize network parameters to minimize this error. (Empirical Risk Minimization)
  5. A Note on Derivatives
    1. Multivariate scalar function derivatives
    2. Minimal of derivates:
      1. Single variable: 1st order derivative as 0, 2nd order derivative as positive.
      2. Multiple variable: 1st order derivative as 0, 2nd order derivates as positve at all dimensions.