03/04/2024 – 03/10/2024
LLaMA: Open and Efficient Foundation Language Models
Insights
Key Idea
- Training smaller with more tokens requires more training resources but lower inference cost.
Training
- Data: Open source data including Crawl, Github, Wikipedia, Books, AxXiv, StackExchange.
- Tokenizer: BPE. 1.4T tokens.
- Optimizer: AdamW. Cosine learning rate.
Architecture
- Standard transformer with following modifications:
- Pre-normalization with RMSNorm. (GPT3)
- SwiGLU activation. (PaLM)
- Rotary Embedding. (GPTNeo)
- Efficient implementation
- Efficient MHA
- Not storing attention weights
- Not computing masked K/V scores
- Efficient MHA
- Reduce amount of activations recomputed during backward pass with checkpointing
FP8 Format for Deep Learning
General
Requires wide exponential range to handle overflow/underflows.
Requires scaling.
Format
- E4M3 – IEEE Standard
- Forward: Weights and activation
- Doesn’t express Infinites
- Express positive and negative zeros and NaNs.
- Keep it symmetric.
- E5M2 – Proposed Standard
- Backward: Gradient
- Express Infinites, positive and negative zeros and NaNs
- Exponential bias
- Use software level shifting
- Per-tensor scaling
- Subnormal number:
- Minimal exponential bits with no leading 1 in matissa. (no zero significant field)
- Created to avoid divide by zero exceptions caused by floating point additions/subtractions
- Nvidia H100 Whitepaper doesn’t expose whether it deal with subnormal numbers
FP8-LM: Training FP8 Large Language Models
Insights
- Imposing low precision data without compromising model accuracy and requiring no changes to hyper-parameters.
- Applied both in pre-training and supervised fine-tuning.
Key Idea
Progressive
- Gradients (Collective communication)
- Optimizer states
- Distributed parallel learning
Invocations
- Precision decoupling
- Automatic scaling
FP8 LLMs
FP8 Gradient and All-Reduce Communication
Gradient
- Pre-scaling: Divide by the number of GPUs before aggregation.
- Post-scaling: Divide by the number of GPUs after aggregation.
- Automatic-scaling: having a scaling factor adjusting by ½x and 2x depending on the maximum values compared to the threshold.
All-Reduce communication
- Global reduction of global minimum scaling factor and use it in reduction. g/s.
FP8 Optimizer
Original paper claims Adam needs 16 bytes of data
- 4 (master weights) + 4 (gradients) + (4 + 4) (Adam states) = 16 bytes.
Precision decoupling: The gradient statistics can use lower precision but the master weights require high precision.
- First order gradient momentum can tolerate a high quantization error.
- Second order gradient momentum cannot.
- Can have underflows with FP8.
Master weights are stored in BF16 with dynamic scaling.
This paper claims:
- 2 (master weights) + 1 (gradients) + (1 + 2) (Adam states) = 6 bytes.
FP8 Distributed Parallel Training
- Data Parallelism and Pipeline Parallelism shows no impact
- Tensor Parallelism
- Distributes each tensor in its entirety across devices while taking tensor scaling into account.
- Not scalable with a large model and large tensor???
- Distributes each tensor in its entirety across devices while taking tensor scaling into account.
Evaluation
Evaluated with 40B tokens on GPT style models.
Sequence Parallelism: Long Sequence Training from System Perspective
Insights
Innovations
- Split long sequences into multiple chunks and feeds into different devices.
- Ring Self-Attention (RSA): circulates key and value embeddings in ring manner
- Applicable with data, tensor and pipeline parallelism to form 4D parallelism.
- Applicable with shorter sequence modeling and sparse attention.
Sequence Parallelism
- Limited to bidirectional self-attention.
- Design
- 2-stage
- Transmitting key embeddings among devices to calculate attention score.
- Perform N – 1 times so the local query slice can run QK^T with all local/remote key slices
- The final output is a slice of QK^T along sequence_q dimension
- Transmitting value embeddings among devices to calculate the output of attention layers.
- Perform N – 1 times so the local score slice can run SV with all local/remote value slices
- The final output is a slice of SV along sequence_q dimension
- Modeling
- Memory usage
- For both MLP and MHA, it is more efficient than tensor parallelism if BL (batch and sequence length) is large enough.
- Communication cost
- Sequence parallelism
- FWD: 2 ring-style P2P communication
- BWD: 2 all-reduce collective communication and 2 ring-style P2P communication
- Tensor parallelism
- FWD/BWD: 4 collective communication
- Sequence has the same communication overhead as tensor. But better compatibility with pipeline as not need to run all-gather through pipeline stages.
- Sequence parallelism
- Memory usage
Rethinking Attention With Performers
- Perform product of KV first instead of QK
Leave a comment