- Recap
- Location independent pattern classification are best performed by scaning for the target pattern.
- Scanning is equivalent to scanning with individual neurons hierarchically.
- Deformations in the input can be handled by max pooling.
- For 2-D (or higher dimensional) scans, the structure is called a convnet.
- For 1-D can along time, the structure is called a time-delay neural network.
- Background of CNN
- Hubel and Wiesel 1959
- First study on neural correlates of vision
- Receptive fields in cat striate cortex
- Receptive fields : Restricted retinal areas which on illumination influenced the firing of single cortical units. Subdivided into excitatory and inhibitory regions.
- Each neuron response to a specific orientation of input light.
- Structure:
- S-Cell: linear arrangement.
- C-Cell: response to the largest output from a back of S-Cells.
- S-Cell and C-Cell. Repeated.
- Kunihiko Fukushima 1980
- Position invariance of input: Your grandmother cell fires not matter the location in your field of vision.
- Neocognitron
- Visual system consists of a hierarchy of modules, each comprising a layer of S-cells followed by a layer of C-cells.
- SC-layer detect a feature.
- Receptive fields at each layer.
- Visual system consists of a hierarchy of modules, each comprising a layer of S-cells followed by a layer of C-cells.
- Learning
- Unsupervised learning
- Randomly initialize S cells, perform Hebbian learning updates
- Adding Supervision
- CNN: LeCun. 1998. LeNet. Single filter.
- Hubel and Wiesel 1959
- CNN (Connvolutional Neural Networks)
- Structure: Convolution, Down Sample and MLP
- Multiple convolutional layers can cluster together before the down sample layer.
- Convolutional layer and MLP layer are learnable.
- What is convolution
- Scanning weights is called filter
- Multiple filters scanning the input.
- Stack multiple filters into a cube.
- Notation:
- s is the id of the convolution layer.
- p is the id of the filter.
- i, j are the 2-D axis of the input.
- k,l are the 2-D axis of the filter.
- Notation:
- Size shrink
- Image size
.
- Filter size
.
- Stride S
- Output size
- Image size
- To avoid size shrink – pad zeros
- Pad the input image all around.
- Convolution is affine combination + activation.
- Down Sampling
- Max polling
.
- Typical stride will be the same as the pooling filter.
- An
picture compressed by a
pooling filter with stride
results in an output map of
.
- Alternatives: Mean polling. P-norm. MLP.
- The goal is combing and accounting jitter.
- Down sampling can be avoided
- Max polling
- Structure: Convolution, Down Sample and MLP
- Overall
- Typicall image classification task
- Input: Color images. RBG. Input is 3 pictures.
- Convolution:
total filter.
- Each filter
.
- Typically K is a power of 2. Filters are typically
,
, or even
.
- Typically stride is 1 or 2.
- Total number of parameters:
. (1 is the bias)
- Pooling:
- Choose: Size of pooling block
. Pooling stride
.
- Typically D is the same as P.
- Choices: Max, Mean, MLP pooling?
- Backpropagation: Derivative of max pooling is problematic. Need to know which one is the actual max get pass through.
- Choose: Size of pooling block
- Terminology:
- Filters are called kernels.
- Output of individual filters are called channels.
- The number of filters is called the number of channels.
- The size of the layers
- Convolutional layer maintains the size of the image.
- Convolutional layer may increase the number of maps from the previous layer.
- Pooling layer decreases the size of the maps by a factor of D
- Filter within a layer must be the same size, but may vary between layers.
- In general, the number of convolutional filters increases with layers.
- Parameters
- Number of convolutional and downsampling layers
- Convolutional layers
- Downsampling layers
- MLP
- Training procedure is similar to MLP. As a flat multi-layer perceptron with the restriction of shared parameters.
- Gradient descent
- Typicall image classification task
Leave a comment