Lecture 9 Convolutional Neural Networks (1/3)

Background
1. Problem: Position independent pattern classification.
  1. The need of shift variance.
2. Solution:
  1. Scan. Multiple identical MLPs at different locations. Max / Perceptron / MLP at the output of all the MLPs.
    1. 1-D scanning for sound
    2. 2-D scanning for images
    3. 3-D and higher-dimensional scans for higher dimensional data
  2. Scanning is equivalent to composing a large network with repeating subnets.
    1. The large network has shared subnets.
  3. A gigantic shared parameter network. Training with shared parameters.
    1. Backpropgation rules must be modified to combine gradients from parameters that share the same value.
    2. Reorder computation. By layers -> By 1-D/2-D data -> By neurons.
    3. Distributing the scan.
      1. First layer looks for smaller patterns.
      2. Second layer looks for larger patterns. A larger pattern is consist of a grid of blocks.
      3. But still a shared parameter network. First layer is double shared. (duplicated x duplicated).
    4. Scanning for a pattern. The operation is called convolution.
Convolutional neural network
1. Vector notation: Weight W(l, j) is now a 3D tensor.
  1. Notation:
    1. l is the id of the current laryer
    2. j is the id of the neuron in the current layer.
    3. $D_{l-1}$ is the number of neurons in the previous layer.
    4. $K_l$ is the size of the block.
  2. Presudocode
    - Y(0) = Image
    - for l = 1:L # Layers operate on vector at (x, y)
      - for j = 1:D # Neurons
        
        for $x = 1:W_{l-1} - K_l + 1$
        
        for $y = 1:H_{l-1} - K_l + 1$
        
        segment = $Y(l-1, :, x:x+K_l-1,y:y+K_l-1)$ #3D tensor
        
        z(l, j, x, y) = W(l, j) segment #tensor inner product
        
        Y(l, j, x, y) = activation(z(l, j, x, y))
    - Y = Softmax(Y(L))
2. Why distribute
  1. Too much parameters.
  2. Without distribute
    1. $(K^2 + 1)N_1$ weights in first layer.
    2. $(N_1 + 1)N_2$ weights in second layer.
    3. Total parameters $O(K^2N_1 + N_1 N_2 +...)$ .
  3. With distribute the representation to 2 layers
    1. Total parameters $O(L_1^2N_1 + L_2^2N_2 + (\frac{K}{L_1L_2})^2N_2N_3 + ...)$ .
  4. Rational:
    1. distribution forces localized patterns in lower layers. More generalization.
    2. Number of parameters: Large reduction in parameters. Significant giants from shared computation.
3. Terminology
  1. First layer – filter
  2. Pattern it is looking for – Receptive field
4. Modifications
  1. Shifting by more pixels – Stride
  2. Account for jitter – Max Pooling
    1. Detect a component of the pattern. For example, petals for flowers.
    2. Only meaningful when stride > 1. Down sampling.
  3. Higher layers. In reality we can have many layers of convolution followed by max pooling before the final MLP. CNN