Stay humble. Stay hungry. Stay foolish.

  1. Background
    1. Problem: Position independent pattern classification.
      1. The need of shift variance.
    2. Solution:
      1. Scan. Multiple identical MLPs at different locations. Max / Perceptron / MLP at the output of all the MLPs.
        1. 1-D scanning for sound
        2. 2-D scanning for images
        3. 3-D and higher-dimensional scans for higher dimensional data
      2. Scanning is equivalent to composing a large network with repeating subnets.
        1. The large network has shared subnets.
      3. A gigantic shared parameter network. Training with shared parameters.
        1. Backpropgation rules must be modified to combine gradients from parameters that share the same value.
        2. Reorder computation. By layers -> By 1-D/2-D data -> By neurons.
        3. Distributing the scan.
          1. First layer looks for smaller patterns.
          2. Second layer looks for larger patterns. A larger pattern is consist of a grid of blocks.
          3. But still a shared parameter network. First layer is double shared. (duplicated x duplicated).
        4. Scanning for a pattern. The operation is called convolution.
  2. Convolutional neural network
    1. Vector notation: Weight W(l, j) is now a 3D D_{l-1}\times K_l\times K_l tensor.
      1. Notation:
        1. l is the id of the current laryer
        2. j is the id of the neuron in the current layer.
        3. D_{l-1} is the number of neurons in the previous layer.
        4. K_l is the size of the block.
      2. Presudocode
        • Y(0) = Image
        • for l = 1:L # Layers operate on vector at (x, y)
          • for j = 1:D # Neurons
            • for x = 1:W_{l-1} - K_l + 1
              • for y = 1:H_{l-1} - K_l + 1
                • segment = Y(l-1, :, x:x+K_l-1,y:y+K_l-1) #3D tensor
                • z(l, j, x, y) = W(l, j) segment #tensor inner product
                • Y(l, j, x, y) = activation(z(l, j, x, y))
        • Y = Softmax(Y(L))
    2. Why distribute
      1. Too much parameters.
      2. Without distribute
        1. (K^2 + 1)N_1 weights in first layer.
        2. (N_1 + 1)N_2 weights in second layer.
        3. Total parameters O(K^2N_1 + N_1 N_2 +...).
      3. With distribute the representation to 2 layers
        1. Total parameters O(L_1^2N_1 + L_2^2N_2 + (\frac{K}{L_1L_2})^2N_2N_3 + ...).
      4. Rational:
        1. distribution forces localized patterns in lower layers. More generalization.
        2. Number of parameters: Large reduction in parameters. Significant giants from shared computation.
    3. Terminology
      1. First layer – filter
      2. Pattern it is looking for – Receptive field
    4. Modifications
      1. Shifting by more pixels – Stride
      2. Account for jitter – Max Pooling
        1. Detect a component of the pattern. For example, petals for flowers.
        2. Only meaningful when stride > 1. Down sampling.
      3. Higher layers. In reality we can have many layers of convolution followed by max pooling before the final MLP. CNN

Tags

Leave a comment