- Recap
- Neural networks are universal approximators (provided right architecture)
- We must train them to approximate any function (architecture, weights, and biases)
- Neural networks are trained to minimize a total loss on a training set (empirical risk minimization)
- We use variants of gradient descent to do so.
- The gradient of the error with respect to the neural network is computed through backpropagation.
- Training Neural Nets by Gradient Descent
- Total training error:
.
- Initialize all weights.
- For every layer compute:
.
- Until Loss has converged.
- Total training error:
- Forward Computation
- Iterate for k = 1, .., N
.
- For j = 1:layer-width:
.
.
- Gradients: Backward Computation
- Initialize Gradient w.r.t network output
.
- For
; For
;
.
- Initialize Gradient w.r.t network output
- Special cases
- Assumptions:
- Neuron computation doesn’t affect the same layer or previous layer neuron computation.
- Neuron outputs combined through weighted addition.
- Activations are differentiable.
- All these assumptions can be violated. (In Slides)
- Examples:
- Vector activations (violates (1)).
- Example: Softmax
if i == j
if i != j
- Example: Softmax
- Vector activations (violates (1)).
- Assumptions:
- Overall Approach
- For each data instance, forward pass and backward pass
- Actuall loss is the sum of the divergence over all traning instances
.
- Actuall gradient is the sum / average of the derivatives computed for each training instace
- Vector formulation
- Problem Statement
- Assume the width at layer k-1 is
, at layer k is
.
are a
dimensional column vectors.
are a
dimensional column vectors.
is a
matrix.
.
.
- Assume the width at layer k-1 is
- Forward:
.
- Backward:
.
- Jocobian Matrix Math Recap:
- Definition: The derivative of a vector function w.r.t vector input is called a Jacobian.
- Jocobian matrix: first order paritial derivatives of vector-valued functions.
- Hessian matrix: second order partital derivatives of scalar-valued functions
- Formula:
- Properties:
- Scalar activation: Jocobian is a diagonal matrix.
- Vector activation: Jocobian is a full matrix.
- Special case:
- Affine function:
.
.
- Vector derivatives:
- For vector functions of vector inputs:
,
,
=>
.
- For scalar functions of vector inputs:
,
.
- For vector functions of vector inputs:
- Scalar functions of affine functions:
,
.
- Affine function:
- Definition: The derivative of a vector function w.r.t vector input is called a Jacobian.
- Backward propagation pass:
- Apply chain rule (vector activation at the last layer).
- Set
.
- Initialize: Compute
.
- For layer k = N downto k = 1:
- Compute Jacobian matrix
.
- Require intermediate values computed in the forward pass.
- Recursion:
.
- Gradient computation:
- Compute Jacobian matrix
- Problem Statement
- Note
- Does backprop do the right thing?
- In classification problems, the classification error is a non-differentiable function of weights. The divergence function is a proxy. It does not minimize classification error.
- Bias-variance tradeoff
- Perceptron – low bias. high variance
- Backprop – high bias. low variance
- Backprop not find a separating solution even if the solution is within the class of function learnable by the network
- Bias-variance tradeoff
- In classification problems, the classification error is a non-differentiable function of weights. The divergence function is a proxy. It does not minimize classification error.
- Does backprop always find global optima? How about local optima? (Loss surface)
- Popular hypothesis
- In larget networks, saddle points are far more common (exponentially) than local minima
- most local minima are equivalent and close to the global minimum
- Saddle point:
- slope zero.
- increase in some directions (eigenvalues of Hessian positive), but decrease in others (eigenvalues of Hessian negative).
- gradient descent often gets stuck in saddle points.
- Popular hypothesis
- Does backprop do the right thing?
Leave a comment