Lecture 11 Convolutional Neural Networks (3/3)

Recap
1. Convolutional Neural Network
  1. Giant MLP with shared parameters.
  2. Stack of convolution layer, pooling layer and MLP.
  3. Filters are perceptions with weights and biases.
2. Training CNN
  1. Similar to regular MLP.
    1. Provide examples.
    2. Define divergence.
      1. $Loss = \frac{1}{T} \sum_{i=1}^T div(Y_i, d_i)$ .
    3. Gradient descent.
      1. Initialize all weights and biases
      2. For every layer l for all filter indices m, update:
        
        $w(l, m, j, x, y) = w(l, m, j, x, y) - \eta \frac{dLoss}{dw(l, m, j, x, y)}$
      3. Until err has converged.
Backpropagation
1. Convolutional layer
  1. For every $l^{th}$ layer filter, each position in the map in the $(l-1)^{th}$ layer affects several positions in the map of the $l^{th}$ layer.
  2. Forward computation: $Y(l - 1) \overset{filter}{\rightarrow} Z(l) \overset{activation}{\rightarrow} Y(l)$
  3. Backpropagation: .
    1. $\frac{dDiv}{dz(l,m,x,y)} = \frac{dDiv}{dY(l,m,x,y)}f'(z(l,m,x,y))$
    2. Each affects several terms.
      1. $\frac{dDiv}{dY(l-1,m,x,y)} = \sum_n \sum_{x'y'} \frac{dDiv}{dz(l,n,x',y')} \frac{dz(l,n,x',y')}{dY(l-1,m,x,y)} = \sum_n \sum_{x'y'} \frac{dDiv}{dz(l,n,x',y')} w_l(m, n, x - x', y - y')$ . Assuming indexing is from 0.
      2. It is obtained by flipping the filter left-right, top-bottom, and computing the inner product with respect to the square patch of $\frac{\partial Div}{\partial z}$ ending at (x, y). Need to use zero pad for it.
    3. Each also affects several terms for every n.
      1. Affect terms in only one z map.
      2. All entries in the map contributes to the derivative of the divergence w.r.t $w_l(m,n,x'y')$ .
      3. $\frac{dDiv}{dw_l(m,n,x,y)} = \sum_{x'y'} \frac{dDiv}{dz(l,n,x',y')} \frac{dz(l,n,x',y')}{dw_l(m,n,x',y')} = \sum_{x'y'} \frac{dDiv}{dz(l,n,x',y')} Y(l-1, m, x +x', y + y')$ .
2. Pooling and downsampling
  1. Derivative of max pooling
    1. $\frac{dDiv}{dY(l,m,k,l)} = \frac{dDiv}{dU(l,m,i,j)} \text{ if } (k, l) = P(l,m,i,j) \text{ else } 0$ .
  2. Derivative of mean pooling
    1. $dy(l,m,k,n) = \frac{1}{K_{lpool}^2}du(l,m,k,n)$ .
Transposed Convolution.
1. Subsequent maps can increase in size.
  1. Add a layer of increased size.
  2. However, maintaining symmetry.
  3. Each neuron has the same number of outgoing weight.
2. In shrinking layers, the same number of incoming weights.
3. In expanding layers, the same number of outgoing weights.
4. 2D Expanding convolution:t
  1. $z(1, i, j) = \sum_m \sum_k \sum_l w(1, m, i - kb, j - lb)l(m, k, l)$ .
  2. b is the stride.
  3. Output size is typically an integer multiple of input.
Tranform Invariance
1. Problem: Currently, CNN provides shift invariance. However, require for rotation, scale, reflection invariance.
2. Solution: Each filter produces a transformed invariance. Enumerated transformed.
Other Model Variations
1. Bounding estimation
2. Pose estimation
3. Very deep networks
  1. ResNet
4. Depth-wise convolutions
  1. Each filter produce a convoluted maps and add them up. Parameter reduction.