- Background
- Problem: Position independent pattern classification.
- The need of shift variance.
- Solution:
- Scan. Multiple identical MLPs at different locations. Max / Perceptron / MLP at the output of all the MLPs.
- 1-D scanning for sound
- 2-D scanning for images
- 3-D and higher-dimensional scans for higher dimensional data
- Scanning is equivalent to composing a large network with repeating subnets.
- The large network has shared subnets.
- A gigantic shared parameter network. Training with shared parameters.
- Backpropgation rules must be modified to combine gradients from parameters that share the same value.
- Reorder computation. By layers -> By 1-D/2-D data -> By neurons.
- Distributing the scan.
- First layer looks for smaller patterns.
- Second layer looks for larger patterns. A larger pattern is consist of a grid of blocks.
- But still a shared parameter network. First layer is double shared. (duplicated x duplicated).
- Scanning for a pattern. The operation is called convolution.
- Scan. Multiple identical MLPs at different locations. Max / Perceptron / MLP at the output of all the MLPs.
- Problem: Position independent pattern classification.
- Convolutional neural network
- Vector notation: Weight W(l, j) is now a 3D
tensor.
- Notation:
- l is the id of the current laryer
- j is the id of the neuron in the current layer.
is the number of neurons in the previous layer.
is the size of the block.
- Presudocode
- Y(0) = Image
- for l = 1:L # Layers operate on vector at (x, y)
- for j = 1:D # Neurons
- for
- for
- segment =
#3D tensor
- z(l, j, x, y) = W(l, j) segment #tensor inner product
- Y(l, j, x, y) = activation(z(l, j, x, y))
- segment =
- for
- for
- for j = 1:D # Neurons
- Y = Softmax(Y(L))
- Notation:
- Why distribute
- Too much parameters.
- Without distribute
weights in first layer.
weights in second layer.
- Total parameters
.
- With distribute the representation to 2 layers
- Total parameters
.
- Total parameters
- Rational:
- distribution forces localized patterns in lower layers. More generalization.
- Number of parameters: Large reduction in parameters. Significant giants from shared computation.
- Terminology
- First layer – filter
- Pattern it is looking for – Receptive field
- Modifications
- Shifting by more pixels – Stride
- Account for jitter – Max Pooling
- Detect a component of the pattern. For example, petals for flowers.
- Only meaningful when stride > 1. Down sampling.
- Higher layers. In reality we can have many layers of convolution followed by max pooling before the final MLP. CNN
- Vector notation: Weight W(l, j) is now a 3D
Leave a comment