Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
Infini-attention
- Maintain compressive memories in parallel with multi-head attention to compress previous K, V states.
- Scaled Dot-Product Attention
- Compressive Memory
- Parameterize memory with associated matrix. Form memory update and retrieval as linear attention mechanism.
- Memory Retrieval
- Normalized QM attention with gated activation on Q
- Memory Updates
- Adds KV attention with gated activation on K
- Applies delta rule to leave matrix unmodified is KV binding already exists
- Linearly combine them with a gated learning parameter
- Compared compute and memory overhead against related works:
- Transformer-XL
- Compressive Transformer
- Memorizing Transformers
- RMT
- AutoCompressors
Implementation
- https://github.com/Beomi/InfiniTransformer
- Long context is broken down into sequences and processed in order with memory carryovers.
MEGALODON: Efficient LLM Pretraining and Inference with Unlimited Context Length
Moving Average Equipped Gated Attention
- Multi-dimensional Damped EMA
- Project sequence into h dimensions, perform damped moving average.
- Main a damped moving average state.
- Project sequence back to the original space.
- Moving Average Equipped Gated Attention
- EMA is used to compute the shared representation.
- Simply split Q,K,V into chunks and process them separately.
Innovations of MegaLodon
- Extending Multi-dimensional Damped EMA to Complex Domain
- Preserve a decaying factor using complex numbers
- Timestep Normalization
- Layer Normalization: Along feature dimension.
- Group Normalization: Along part of feature dimension x sequence dimension.
- Timestamp Normalization: Modified group normalization in autoregressive fashion.
- Normalized Attention in MEGALODON
- Q,K uses shared, normalized representations instead of the same
- Pre-Norm with Two-hop Residual
- Pre-normalization is more stable than post-normalization
- Rearrange residue connection to have the same residual connection to pre-ffn and pre-next-attn.
- 4D Parallelism Training
- Exchange hidden state across layers
Implementation
- https://github.com/XuezheMax/megalodon
- 4D parallelism
Leave a comment