Reading Notes – 7

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

https://github.com/Beomi/InfiniTransformer
- Long context is broken down into sequences and processed in order with memory carryovers.

Multi-dimensional Damped EMA
- Project sequence into h dimensions, perform damped moving average.
- Main a damped moving average state.
- Project sequence back to the original space.
Moving Average Equipped Gated Attention
- EMA is used to compute the shared representation.
- Simply split Q,K,V into chunks and process them separately.

Extending Multi-dimensional Damped EMA to Complex Domain
- Preserve a decaying factor using complex numbers
Timestep Normalization
- Layer Normalization: Along feature dimension.
- Group Normalization: Along part of feature dimension x sequence dimension.
- Timestamp Normalization: Modified group normalization in autoregressive fashion.
Normalized Attention in MEGALODON
- Q,K uses shared, normalized representations instead of the same
Pre-Norm with Two-hop Residual
- Pre-normalization is more stable than post-normalization
- Rearrange residue connection to have the same residual connection to pre-ffn and pre-next-attn.
4D Parallelism Training
- Exchange hidden state across layers

Implementation