levendlee

Stay humble. Stay hungry. Stay foolish.

Lecture 12 Recurrent Networks (1/5)

Written in

June 20, 2020

by

levendlee

Introduction to Deep Learning, Machine Learning

1. Background
  1. Application scenario. Analysis of time-series data.
  2. Finite-response model.
  3. Infinite-response model.
    1. .
      1. Require initial state: $Y_{t-1}$ for $t = 0$ .
      2. $Y_0$ produce $Y_1$ , which produces $Y_2$ and so on.
2. Infinite-response model.
  1. NARX network:
    1. Definition: nonlinear autoregressive network with exogenous inputs. The output contains information about the entire past.
    2. General: several previous inputs and outputs.
    3. Complete: all previous inputs and output.
    4. Note: the memory of the past is completely stored in the output itself, not in the network.
  2. Alternative
    1. Goal: put the memory into the network.
    2. Method: introduce a memory unit to store information of the past.
      1. Memory unit: $m_t = r(y_{t-1}, h_{t-1}, m_{t-1})$ .
      2. Hidden value: $h_t = f(x_t, m_t)$ .
      3. Output: $y_t = g(h_t)$ .
    3. Jordan network: maintain a running average of outputs in a memory unit
    4. Elman network: store hidden unit values for one time instant in a context unit
    5. Both network are partially recurrent because during learning current error does not actually propagate to the past.
3. State-space model.
  1. $h_t = f(x_t, h_{t-1})$ .
  2. $y_t = g(h_t)$ .
  3. $h_t$ is the state of the network
  4. Need to definite initial state $h_{-1}$ .
  5. This is a fully recurrent neural network. The state summarizes information about the entire past.
  6. Equations:
    1. $h_i^{(1)}(-1) = \text{part of network parameters}$ .
    2. 1. State node activation function $f_1()$ is typically $tanh()$ .
    3. $Y(t) = f_2(\sum_j w_{jk}^{(1)} h_j^{(1)}(t) + b_k^{(1)}, k = 1..M)$
  7. Variants
    1. One to one: conventional MLP
    2. One to many: sequence generation. image caption
    3. Many to one: sequence based classification to prediction. speech recognition, text classification.
    4. Manny to Many, shifted: Delayed sequence to sequence. machine translation.
    5. Many to many, unshifted: stock problem, etc.
    6. Summary
      1. Time series must consider past inputs with current input.
      2. Looking into the infinite past requires recursion.
      3. NARX -> feeding back the output to the input.
      4. Simple recurrent networks maintain memory or context.
      5. State-space models retain information about the past through recurrent hidden states.
        
        Enable current error to update parameters in the past
4. Backpropagation through time (BPTT)
  1. Forward
  2. Backprop
    1. for all i for all T
      1. Simplified $\frac{dDiv}{dY_i(t)} = \frac{dDiv(t)}{dY_i(t)}$ .
    2. $\frac{dDiv}{dZ_i^{(1)}(T)} = \frac{dDiv}{dY_i(T)} \frac{dY_i(T)}{dZ_i^{(i)}(T)}$
    3. $\frac{dDiv}{dh_i(T)} = \sum_jw_{ij}^{(1)} \frac{dDiv}{dZ_j^{(1)}(T)}$
    4. $\frac{dDiv}{dw_{ij}^{(1)}} = h_i(T)\frac{dDiv}{dZ_j^{(1)}}(T)$ .
    5. Backprop
5. Bidirectional RNN (BRNN)
  1. RNN with both forward and backward recursion. Explictly model that the future can be predicted by the past and the past can be predicted by the future.
  2. Basic Structure: One hidden layer connecting forward; One hidden layer connecting backward. Each part work independently.
  3. Implementation: Implement the forward pass. Reuse the backward pass with flipping.

Tags

Leave a comment Cancel reply