Stay humble. Stay hungry. Stay foolish.

    1. Background
      1. Application scenario. Analysis of time-series data.
      2. Finite-response model.
      3. Infinite-response model.
        1. Y_t = f(X_t, Y_{t-1}).
          1. Require initial state: Y_{t-1} for t = 0.
          2. Y_0 produce Y_1, which produces Y_2 and so on.
    2. Infinite-response model.
      1. NARX network:
        1. Definition: nonlinear autoregressive network with exogenous inputs. The output contains information about the entire past.
        2. General: several previous inputs and outputs.
        3. Complete: all previous inputs and output.
        4. Note: the memory of the past is completely stored in the output itself, not in the network.
      2. Alternative
        1. Goal: put the memory into the network.
        2. Method: introduce a memory unit to store information of the past. 
          1. Memory unit: m_t = r(y_{t-1}, h_{t-1}, m_{t-1}).
          2. Hidden value: h_t = f(x_t, m_t).
          3. Output: y_t = g(h_t).
        3. Jordan network: maintain a running average of outputs in a memory unit
        4. Elman network: store hidden unit values for one time instant in a context unit
        5. Both network are partially recurrent because during learning current error does not actually propagate to the past.
    3. State-space model.
      1. h_t = f(x_t, h_{t-1}).
      2. y_t = g(h_t).
      3. h_t is the state of the network
      4. Need to definite initial state h_{-1}.
      5. This is a fully recurrent neural network. The state summarizes information about the entire past.
      6. Equations:
        1. h_i^{(1)}(-1) = \text{part of network parameters}.
        2. h_i^{(1)}(t) = f_1(\sum_j w_{ji}^{(0)}X_j(t) + \sum_j w_{ji}^{(11)}h_i^{(1)}(t-1) + b_i^{(1)})
          1. State node activation function f_1() is typically tanh().
        3. Y(t) = f_2(\sum_j w_{jk}^{(1)} h_j^{(1)}(t) + b_k^{(1)}, k = 1..M)
      7. Variants
        1. One to one: conventional MLP
        2. One to many: sequence generation. image caption
        3. Many to one: sequence based classification to prediction. speech recognition, text classification.
        4. Manny to Many, shifted: Delayed sequence to sequence. machine translation.
        5. Many to many, unshifted: stock problem, etc.
        6. Summary
          1. Time series must consider past inputs with current input.
          2. Looking into the infinite past requires recursion.
          3. NARX -> feeding back the output to the input.
          4. Simple recurrent networks maintain memory or context.
          5. State-space models retain information about the past through recurrent hidden states.
            1. Enable current error to update parameters in the past
    4. Backpropagation through time (BPTT)
      1. ForwardScreen Shot 2020-06-20 at 15.10.14
      2. Backprop
        1. \frac{dDiv}{dY_i(t)} for all i for all T
          1. Simplified \frac{dDiv}{dY_i(t)} = \frac{dDiv(t)}{dY_i(t)}.
        2. \frac{dDiv}{dZ_i^{(1)}(T)} = \frac{dDiv}{dY_i(T)} \frac{dY_i(T)}{dZ_i^{(i)}(T)}
        3. \frac{dDiv}{dh_i(T)} = \sum_jw_{ij}^{(1)} \frac{dDiv}{dZ_j^{(1)}(T)}
        4. \frac{dDiv}{dw_{ij}^{(1)}} = h_i(T)\frac{dDiv}{dZ_j^{(1)}}(T).
        5. BackpropScreen Shot 2020-06-20 at 15.10.31
    5. Bidirectional RNN (BRNN)
      1. RNN with both forward and backward recursion. Explictly model that the future can be predicted by the past and the past can be predicted by the future.
      2.  Basic Structure: One hidden layer connecting forward; One hidden layer connecting backward. Each part work independently.
      3. Implementation: Implement the forward pass. Reuse the backward pass with flipping.

Tags

Leave a comment