Reading Notes – 2

03/04/2024 – 03/10/2024

LLaMA: Open and Efficient Foundation Language Models

Insights

Key Idea

Training smaller with more tokens requires more training resources but lower inference cost.

Training

Data: Open source data including Crawl, Github, Wikipedia, Books, AxXiv, StackExchange.
Tokenizer: BPE. 1.4T tokens.
Optimizer: AdamW. Cosine learning rate.

Architecture

Standard transformer with following modifications:
- Pre-normalization with RMSNorm. (GPT3)
- SwiGLU activation. (PaLM)
- Rotary Embedding. (GPTNeo)
Efficient implementation
- Efficient MHA
  - Not storing attention weights
  - Not computing masked K/V scores
Reduce amount of activations recomputed during backward pass with checkpointing

FP8 Format for Deep Learning

General

Requires wide exponential range to handle overflow/underflows.

Requires scaling.

Format

E4M3 – IEEE Standard
- Forward: Weights and activation
- Doesn’t express Infinites
- Express positive and negative zeros and NaNs.
  - Keep it symmetric.
E5M2 – Proposed Standard
- Backward: Gradient
- Express Infinites, positive and negative zeros and NaNs
Exponential bias
- Use software level shifting
Per-tensor scaling

Subnormal number:
- Minimal exponential bits with no leading 1 in matissa. (no zero significant field)
- Created to avoid divide by zero exceptions caused by floating point additions/subtractions
Nvidia H100 Whitepaper doesn’t expose whether it deal with subnormal numbers

FP8-LM: Training FP8 Large Language Models

Insights

Imposing low precision data without compromising model accuracy and requiring no changes to hyper-parameters.
Applied both in pre-training and supervised fine-tuning.

Key Idea

Progressive

Gradients (Collective communication)
Optimizer states
Distributed parallel learning

Invocations

Precision decoupling
Automatic scaling

FP8 LLMs

FP8 Gradient and All-Reduce Communication

Gradient

Pre-scaling: Divide by the number of GPUs before aggregation.
Post-scaling: Divide by the number of GPUs after aggregation.
Automatic-scaling: having a scaling factor adjusting by ½x and 2x depending on the maximum values compared to the threshold.

All-Reduce communication

Global reduction of global minimum scaling factor and use it in reduction. g/s.

FP8 Optimizer

Original paper claims Adam needs 16 bytes of data

4 (master weights) + 4 (gradients) + (4 + 4) (Adam states) = 16 bytes.

Precision decoupling: The gradient statistics can use lower precision but the master weights require high precision.

First order gradient momentum can tolerate a high quantization error.
Second order gradient momentum cannot.
- Can have underflows with FP8.

Master weights are stored in BF16 with dynamic scaling.

This paper claims:

2 (master weights) + 1 (gradients) + (1 + 2) (Adam states) = 6 bytes.

FP8 Distributed Parallel Training

Data Parallelism and Pipeline Parallelism shows no impact
Tensor Parallelism
- Distributes each tensor in its entirety across devices while taking tensor scaling into account.
  - Not scalable with a large model and large tensor???

Evaluation

Evaluated with 40B tokens on GPT style models.

Sequence Parallelism: Long Sequence Training from System Perspective

Insights

Innovations

Split long sequences into multiple chunks and feeds into different devices.
- Ring Self-Attention (RSA): circulates key and value embeddings in ring manner
Applicable with data, tensor and pipeline parallelism to form 4D parallelism.
Applicable with shorter sequence modeling and sparse attention.

Sequence Parallelism

Limited to bidirectional self-attention.
Design
- 2-stage
- Transmitting key embeddings among devices to calculate attention score.
  - Perform N – 1 times so the local query slice can run QK^T with all local/remote key slices
  - The final output is a slice of QK^T along sequence_q dimension
- Transmitting value embeddings among devices to calculate the output of attention layers.
  - Perform N – 1 times so the local score slice can run SV with all local/remote value slices
  - The final output is a slice of SV along sequence_q dimension
Modeling
- Memory usage
  - For both MLP and MHA, it is more efficient than tensor parallelism if BL (batch and sequence length) is large enough.
- Communication cost
  - Sequence parallelism
    - FWD: 2 ring-style P2P communication
    - BWD: 2 all-reduce collective communication and 2 ring-style P2P communication
  - Tensor parallelism
    - FWD/BWD: 4 collective communication
  - Sequence has the same communication overhead as tensor. But better compatibility with pipeline as not need to run all-gather through pipeline stages.

Rethinking Attention With Performers

Perform product of KV first instead of QK

levendlee

Reading Notes – 2

LLaMA: Open and Efficient Foundation Language Models

Insights

Key Idea

Training

Architecture

FP8 Format for Deep Learning

General

Format

FP8-LM: Training FP8 Large Language Models

Insights

Key Idea

Progressive

Invocations

FP8 LLMs

FP8 Gradient and All-Reduce Communication

FP8 Optimizer

FP8 Distributed Parallel Training

Evaluation

Sequence Parallelism: Long Sequence Training from System Perspective

Insights

Innovations

Sequence Parallelism

Rethinking Attention With Performers

Leave a comment Cancel reply

Insights

Key Idea

Training

Architecture

General

Format

Insights

Key Idea

Progressive

Invocations

FP8 LLMs

FP8 Gradient and All-Reduce Communication

FP8 Optimizer

FP8 Distributed Parallel Training

Evaluation

Insights

Innovations

Sequence Parallelism

Share this:

Leave a comment Cancel reply