Lecture 2 The Universal Approximation Theorem

Recap
1. Perceptron: Affine combination of input.
  $z = \sum_i(w_i x_i - T)$
2. Activation functions:
  1. Binary Step: $y = 1 if z \ge 0 else 0$
  2. Sigmod: $y = \frac{1}{1 + exp(-z)}$
  3. Tanh: $y = tanh(z)$ .
3. The multi-layer perceptron
  1. Input layer, Hidden layer, Output layer
  2. Deep network: Has at least 3 layers
    Depth of DAG is the length of the longest path from a source to a sink
  3. MLPs approximate functions
MLPs as a Boolean Functions
1. A boolean function is a truth table.
2. Disjunctive Normal Form (DNF)
  1. One hidden layer MLP is a Universal Boolean Function.
  2. However, the hidden layer can be exponentially large.
3. Karnaugh Map
  1. Adjacent boxes can be grouped to reduce the complexity of the DNF formula.
4. The need for Depth
  1. Worst Case: Largest irreducible DNF. Chessboard.
  2. Single Layer: N Inputs, Requires $2^{N-1}$ perceptrons in the hidden layer.
  3. Multi-Layer: N Input, Requires $3(N-1)$ perceptrons.
    If use a tree structure. Requires $log_2(N)$ layers.
    If the depth is limited, the size of the perceptrons is exponential again.
  4. If the boolean function can be expressed as some preliminary computation followed by XORs, the same problem surges.
  5. Any Boolean Circuit of depth d using AND, OR and NOT gates with unbounded fan-in must have size $2^{n\frac{1}{d}}$ . (actually a threshold circuit, not a boolean circuit.)
MLPs as a Universal Classifiers
1. Complex decision boundaries.
  1. Example: Two hidden layers.
    Use a combination of convex polygons to form the decision boundary. (Using OR logic).
  2. Universal: One hidden layer.
    1. Use polygons that have N edges.
    2. The area inside polygon has a sum of N and the majority area outside polygon has an area of N/2. Set the threshold as N/2.
    3. As N -> infinity, the polygons will become circles. (If consider the sum as the 3rd dimension, they will become cylinders.)
    4. The circles will fill the target polygons. (Using OR logic)
2. Optimal depth
  1. View this problem as arithmetic circuits.
    1. Valiant: A polynomial of degree n requires a network of depth $log^2(n)$ . (Cannot be shallower)
    2. Otherwise, it will be exponentially large
  2. Example: Cheeseboard example
    1. One Layer: Exponential number of neurons to approximate.
    2. Two Layers: For an 8 x 8 chessboard. 16 in the hidden layer 1 (to represent 16 lines). 40 in the hidden layer 2 (to represent 40 yellow boxes)
    3. Multiple Layers of XOR network: Depends on the dimension of the input, the depth can be higher, but the number of neurons can be less.
    4. Overall: The number of neurons required in a shallow network is potentially exponential in the dimension of input. (Statistically independent)
MLPs as a Universal Approximators for Continuous Functions
1. How to approximate
  1. Similar to 3.1.2. Use N dimension cylinders to approximate function.
  2. The network is actually a universal map from the entire domain of input values to the entire range of the output activation.
2. Sufficiency of architecture
  1. Each hidden layer needs sufficiency wide to preserve the information. For example, using 8 neurons on the first hidden layer will lose the information of a 16 dimension input.
  2. However, with continuous-valued activation function (not a threshold function), the information may be preserved.
  3. In practice, something like a Parametric rectified linear unit is more helpful as it preserves information.
Summary
1. MLP is a Univeral Boolean Function.
2. But it can represent a given function only if 1) sufficiently wide and 2) sufficiently deep.
3. Deeper networks require far fewer neurons than shallower networks to express the same function. (Single layer may require an exponentially large number of perceptrons)