Stay humble. Stay hungry. Stay foolish.

Reading Notes – 7

Written in

by

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Infini-attention

  • Maintain compressive memories in parallel with multi-head attention to compress previous K, V states.
    • Scaled Dot-Product Attention
    • Compressive Memory
      • Parameterize memory with associated matrix. Form memory update and retrieval as linear attention mechanism.
      • Memory Retrieval
        • Normalized QM attention with gated activation on Q
      • Memory Updates
        • Adds KV attention with gated activation on K
        • Applies delta rule to leave matrix unmodified is KV binding already exists
    • Linearly combine them with a gated learning parameter
  • Compared compute and memory overhead against related works:
    • ​​Transformer-XL
    • Compressive Transformer
    • Memorizing Transformers
    • RMT
    • AutoCompressors

Implementation

MEGALODON: Efficient LLM Pretraining and Inference with Unlimited Context Length

Moving Average Equipped Gated Attention

  • Multi-dimensional Damped EMA
    • Project sequence into h dimensions, perform damped moving average.
    • Main a damped moving average state.
    • Project sequence back to the original space.
  • Moving Average Equipped Gated Attention
    • EMA is used to compute the shared representation.
    • Simply split Q,K,V into chunks and process them separately.

Innovations of MegaLodon

  • Extending Multi-dimensional Damped EMA to Complex Domain
    • Preserve a decaying factor using complex numbers
  • Timestep Normalization
    • Layer Normalization: Along feature dimension.
    • Group Normalization: Along part of feature dimension x sequence dimension.
    • Timestamp Normalization: Modified group normalization in autoregressive fashion.
  • Normalized Attention in MEGALODON
    • Q,K uses shared, normalized representations instead of the same
  • Pre-Norm with Two-hop Residual
    • Pre-normalization is more stable than post-normalization
    • Rearrange residue connection to have the same residual connection to pre-ffn and pre-next-attn.
  • 4D Parallelism Training
    • Exchange hidden state across layers

Implementation

Tags

Leave a comment