Recap
- Modification to the original perceptron
- Bias can be viewed as the weight of another constant input of 1
- Activation functions are not necessarily threshold functions
- Design the network
- The structure of the network is feed-forward: No loops. Neuron outputs do not feedback to their inputs directly or indirectly.
- The parameter of the network: The weights and biases.
Learning the Network: Determine these parameters.
Construct by hand: Shape decision boundaries
An automatic estimate of MLP
- Ideal Target:
- Definition:
Whenhas the capacity to exactly represent
.
.
is a divergence function (loss function) that goes to zero when
.
- Intuition:
- In other word, minimize the divergence to the target function.
- Definition:
- Practical Problem
- Problem:
is unknown. Cannot perform the integral on the input domain.
- Solution: Sample the function.
- Assumption: The sampling of
matches the natural distribution of
.
- Note: The professor implies the assumption is i.i.d. However,
may not be evenly sampled and the regularization is focused on the range where
appears more frequently.
- Note: The professor implies the assumption is i.i.d. However,
- Intuition:
- Fit the sample well and then hope it fits the function well.
- Assumption: The sampling of
- Problem:
- Methodology:
- A single perceptron (linear seperatable):
- Target:
: A hyperplane formed by all vectors
that is orthogonal to
. (Here
is not the inputs but the vectors on the hyperplane).
- Method: PLA.
. If
, then
.
- Theory guarantee: Find hyperplane in a finite number of steps
if linear separable.
- R: the length of the longest vector
: the best-case margin
- Method: PLA.
- Intuition:
- If there is one dot represents 1 / -1, the ideal weight vector points to the dot/opposite to the dot.
- If there are more dots, keep combining the dots.
- Target:
- Multiple perceptrons (not linearly separable):
- PLA: It cannot work to learn an MLP. Exponential complexity of assign intermediate labels. (Require input-output relation for every perceptron).
- Greedy algorithms: Adaline and Madeline
- For each perceptron, flipping each instance at a time
- Continuous value activation function:
- Problem
- Individual neurons’ weight can change significantly without changing the overall error. -> non-differentiable.
- Real-life data are not linearly separable.
- Solution:
- Continous value activation function.
- Popular choice: Sigmoid
- Intuition: Increasing X and calculate the average value of Y in a small interval, the probability of Y=1 form a sigmoid curve.
- The logistic regression model:
,
.
- Popular choice: Sigmoid
- As an activiation function is differentiable:
,
,
,
.
- The entire network is differentiable as well:
- Minimize the expected error:
- Emphase the more frequent values of X.
- Empirical Risk Minimization:
- Ideal:
(Discrete) (E posteriori probablilty)
- Empirical estimate:
- Loss(W) =
- Problem Statement:
- Given a traing set of input-output pairs:
.
- Minimize the loss function
Loss(W) =
w.r.t W - The problem of function minimization
- Given a traing set of input-output pairs:
- Ideal:
- Continous value activation function.
- Problem
- Summary:
- Threshold-activation requires solving a hard combinatorial-optimization problem
- Continuous activation functions with non-zero derivatives to enable us to estimate network parameters
- Logistic activation perceptron computes the a posteriori probability fo the output given the input
- The differentiable divergence between the output of the network and the desired output (Both all activation functions and divergence need to be differentiable).
- Optimize network parameters to minimize this error. (Empirical Risk Minimization)
- A Note on Derivatives
- Multivariate scalar function derivatives
- Minimal of derivates:
- Single variable: 1st order derivative as 0, 2nd order derivative as positive.
- Multiple variable: 1st order derivative as 0, 2nd order derivates as positve at all dimensions.
- A single perceptron (linear seperatable):
Leave a comment