Lecture 10 Convolutional Neural Networks (2/3)

Recap
1. Location independent pattern classification are best performed by scaning for the target pattern.
2. Scanning is equivalent to scanning with individual neurons hierarchically.
3. Deformations in the input can be handled by max pooling.
4. For 2-D (or higher dimensional) scans, the structure is called a convnet.
5. For 1-D can along time, the structure is called a time-delay neural network.
Background of CNN
1. Hubel and Wiesel 1959
  1. First study on neural correlates of vision
  2. Receptive fields in cat striate cortex
  3. Receptive fields : Restricted retinal areas which on illumination influenced the firing of single cortical units. Subdivided into excitatory and inhibitory regions.
  4. Each neuron response to a specific orientation of input light.
  5. Structure:
    1. S-Cell: linear arrangement.
    2. C-Cell: response to the largest output from a back of S-Cells.
    3. S-Cell and C-Cell. Repeated.
2. Kunihiko Fukushima 1980
  1. Position invariance of input: Your grandmother cell fires not matter the location in your field of vision.
  2. Neocognitron
    1. Visual system consists of a hierarchy of modules, each comprising a layer of S-cells followed by a layer of C-cells.
      1. SC-layer detect a feature.
      2. Receptive fields at each layer.
  3. Learning
    1. Unsupervised learning
    2. Randomly initialize S cells, perform Hebbian learning updates
  4. Adding Supervision
    1. CNN: LeCun. 1998. LeNet. Single filter.
CNN (Connvolutional Neural Networks)
1. Structure: Convolution, Down Sample and MLP
  1. Multiple convolutional layers can cluster together before the down sample layer.
  2. Convolutional layer and MLP layer are learnable.
2. What is convolution
  1. Scanning weights is called filter
  2. Multiple filters scanning the input.
  3. Stack multiple filters into a cube.
    1. 1. Notation:
        
        s is the id of the convolution layer.
        
        p is the id of the filter.
        
        i, j are the 2-D axis of the input.
        
        k,l are the 2-D axis of the filter.
  4. Size shrink
    1. Image size $N \times N$ .
    2. Filter size $M \times M$ .
    3. Stride S
    4. Output size $\lfloor {(N-M)/S} \rfloor+ 1$
  5. To avoid size shrink – pad zeros
    1. Pad the input image all around.
  6. Convolution is affine combination + activation.
3. Down Sampling
  1. Max polling
    1. $Y_m^{(n)}(i,j) = f(z_m^{(n)}(i, j))$ .
    2. Typical stride will be the same as the pooling filter.
    3. An $N \times N$ picture compressed by a $P \times P$ pooling filter with stride $D$ results in an output map of $\lfloor (N-P)/D \rfloor + 1$ .
  2. Alternatives: Mean polling. P-norm. MLP.
    1. The goal is combing and accounting jitter.
  3. Down sampling can be avoided
Overall
1. Typicall image classification task
  1. Input: Color images. RBG. Input is 3 pictures.
  2. Convolution:
    1. $K_1$ total filter.
    2. Each filter $L \times L \times 3$ .
    3. Typically K is a power of 2. Filters are typically $5 \times 5 (\times 3)$ , $3 \times 3 (\times 3)$ , or even $1 \times 1 (\times 3)$ .
    4. Typically stride is 1 or 2.
    5. Total number of parameters: $K_1(3L^2 + 1)$ . (1 is the bias)
  3. Pooling:
    1. Choose: Size of pooling block . Pooling stride .
      1. Typically D is the same as P.
    2. Choices: Max, Mean, MLP pooling?
    3. Backpropagation: Derivative of max pooling is problematic. Need to know which one is the actual max get pass through.
  4. Terminology:
    1. Filters are called kernels.
    2. Output of individual filters are called channels.
    3. The number of filters is called the number of channels.
  5. The size of the layers
    1. Convolutional layer maintains the size of the image.
    2. Convolutional layer may increase the number of maps from the previous layer.
    3. Pooling layer decreases the size of the maps by a factor of D
    4. Filter within a layer must be the same size, but may vary between layers.
    5. In general, the number of convolutional filters increases with layers.
  6. Parameters
    1. Number of convolutional and downsampling layers
    2. Convolutional layers
    3. Downsampling layers
    4. MLP
  7. Training procedure is similar to MLP. As a flat multi-layer perceptron with the restriction of shared parameters.
    1. Gradient descent