levendlee

Stay humble. Stay hungry. Stay foolish.

Reading Notes – 4

Written in

April 2, 2024

by

levendlee

Machine Learning, Reading

03/25/2024-03/31/2024

Self-Attention Does Not Need O(N^2) Memory

Key Idea

Split Q, K, V along batch_dim, seq_dim, head_dim, having Q, K, V as 1D vector as [feature_dim].
Q@K produces a scalar S. S@V produces a vector V.
Delay the softmax operation by updating scalar S and vector V after all K, V vectors.
Correspondingly, handling self-attention requires [batch_dim, seq_dim, feature_dim] for each head. Instead of [batch_dim, seq_dim, seq_dim]. When seq_dim >> feature_dim, it is a massive memory saving.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

State Space Model

Discretization.
- Transforms continuous parameters into discrete parameters.
Computation.
- Global convolution for parallelizable training.
- Linear recurrent for autoregressive inference.
Linear Time Invariant (LTI)
- Model dynamics are constant throughout time.
- Limitations to other SSMs.
Maps 1-D sequence into an implicit laten state.
Success Models
- Linear Attention -> H3 -> Hyena -> RetNet -> RWKV

Selective State Space Model

Algorithm
- LTI (RNN/CNN) are limited by constant states.
- -> To improve the performance, we should be selective on the states.
- -> Let the parameters affecting sequence interactions to be input dependent
Implementation
- Materialize hidden state on SRAM

Tags

Leave a comment Cancel reply