- Recap
- The MLP can represent any function
- How to learn the function
- By minimizing expected error
- By sampling the function
- The empirical risk
- Empirical risk minimization
- Math Basics
- Gradients of scalar functions with multi-variate inputs
(scalar, row vector, column vector)
- The inner product is maximum (minimum) when two vectors have the same (opposite) direction.
- Move
opposite to
.
- Move
- Gradients of scalar functions with multi-variate inputs
- Unconstrained Minimization of function
- Closed-form solution
- Solve the for
as
.
- Compute the Hessian Matrix
at the candidate solution
- Hessian is positive definite (eigenvalues positive) -> local minima
- Hessian is negative definite (eigenvalues negative) -> local maxima
- Solve the for
- Iterative solution
- Overview: Begin with a guess and refine it iteratively
- 1D Example:
- Start from an initial guess
for the optimal
.
- Update the guess towards a (hopefully) “better” value of
.
.
is the step size.
- Positive derivative -> move left decrease error
- Negative derivative -> move right decrease error
- Stop when
no longer decreases.
is no always guaranteed. For example, the global minimum is out of boundary.
- Start from an initial guess
- Gradient descent / ascent (multivariate)
- Find maximum
.
- Find minimum
.
- Find maximum
- Closed-form solution
- Problem Statement
- Goal: Given
. Minimize
. w.r.t. W.
- Typical Network
- MLP
- Layered. Acyclic.
- Input layers + hidden layers + output layers.
- Individual Neurons
.
is the activation function.
- Activation functions
- Almost differentiable everywhere. Except ReLU. Use sub-gradient
for ReLU at
.
- Scalar Activations
.
- Vector Activations
. In a layered network, each layer of perceptrons can be viewed as a single vector activation.
- Example: Softmax
- Example: Softmax
- Difference between scalar/vector activations: Change one weight influence all the outputs.
- Almost differentiable everywhere. Except ReLU. Use sub-gradient
- Notations
- Input layer is 0th layer.
- Output of i th perceptron of the k th layer as
.
- Represent weight of connection between i th unit of k-1 th layer and the j th unit and k th layer as
.
- The bias to the j th unit of the k th layer is
.
- Input / Output
- Input : X, D dimensional vectors
- Desired Output: d, L dimensional vectors
- Actual Output: y, L dimensional vectors
- Output
- Real values: direct output
- Binary classifier: give representations
- Output activation (Usually sigmoid)
- Viewed as probability
.
- Viewed as probability
- Output activation (Usually sigmoid)
- Multi-class classifier: One-hot representations
- N class. N binary outputs.
- Use vector activations. Posterior probability vector.
- Divergence
- Need to be differentiable
- Popular
- Real values: the (scaled) L2 divergence
.
- Binary classifier: the cross-entropy
.
- Derivative:
else
.
- Note: When Y = 1 and d = 1, the derivative is not 0. Because the output is restricted between {0, 1}.
- Derivative:
- Multi-class classifier:
- Basic:
. (only one d_i is going to be 1)
- Derivative:
for the c-th component; 0 for the remaining component.
- Note: y_c < 1 the slope is negative, indicate increasing y_c will reduce divergence.
- Derivative:
- Label smoothing: set the target output to be
with.
- Derivative:
for the c-th component;
for the remaining component.
- Note: Get a reasonable derivate even for non-target classes.
- Derivative:
- Basic:
- Total derivative:
.
- Math Recap:
.
- Derivative:
.
- Real values: the (scaled) L2 divergence
- Goal: Given
Leave a comment