Stay humble. Stay hungry. Stay foolish.

Reading Notes – 2

Written in

by

03/04/2024 – 03/10/2024

LLaMA: Open and Efficient Foundation Language Models

Insights

Key Idea

  • Training smaller with more tokens requires more training resources but lower inference cost.

Training

  • Data: Open source data including Crawl, Github, Wikipedia, Books, AxXiv, StackExchange. 
  • Tokenizer: BPE. 1.4T tokens.
  • Optimizer: AdamW. Cosine learning rate.

Architecture

  • Standard transformer with following modifications:
    • Pre-normalization with RMSNorm. (GPT3)
    • SwiGLU activation. (PaLM)
    • Rotary Embedding. (GPTNeo)
  • Efficient implementation
    • Efficient MHA
      • Not storing attention weights
      • Not computing masked K/V scores
  • Reduce amount of activations recomputed during backward pass with checkpointing

FP8 Format for Deep Learning

General

Requires wide exponential range to handle overflow/underflows.

Requires scaling.

Format

  • E4M3 – IEEE Standard
    • Forward: Weights and activation
    • Doesn’t express Infinites
    • Express positive and negative zeros and NaNs.
      • Keep it symmetric.
  • E5M2  – Proposed Standard
    • Backward: Gradient
    • Express Infinites, positive and negative zeros and NaNs
  • Exponential bias
    • Use software level shifting
  • Per-tensor scaling
  • Subnormal number:
    • Minimal exponential bits with no leading 1 in matissa. (no zero significant field)
    • Created to avoid divide by zero exceptions caused by floating point additions/subtractions
  • Nvidia H100 Whitepaper doesn’t expose whether it deal with subnormal numbers

FP8-LM: Training FP8 Large Language Models

Insights

  • Imposing low precision data without compromising model accuracy and requiring no changes to hyper-parameters.
  • Applied both in pre-training and supervised fine-tuning.

Key Idea

Progressive

  • Gradients (Collective communication) 
  • Optimizer states 
  • Distributed parallel learning

Invocations

  • Precision decoupling
  • Automatic scaling

FP8 LLMs

FP8 Gradient and All-Reduce Communication

Gradient

  • Pre-scaling: Divide by the number of GPUs before aggregation.
  • Post-scaling: Divide by the number of GPUs after aggregation.
  • Automatic-scaling: having a scaling factor adjusting by ½x and 2x depending on the maximum values compared to the threshold.

All-Reduce communication

  • Global reduction of global minimum scaling factor and use it in reduction. g/s.

FP8 Optimizer

Original paper claims Adam needs 16 bytes of data 

  • 4 (master weights) + 4 (gradients) + (4 + 4) (Adam states) = 16 bytes.

Precision decoupling: The gradient statistics can use lower precision but the master weights require high precision.

  • First order gradient momentum can tolerate a high quantization error.
  • Second order gradient momentum cannot.
    • Can have underflows with FP8.

Master weights are stored in BF16 with dynamic scaling.

This paper claims:

  • 2 (master weights) + 1 (gradients) + (1 + 2) (Adam states) = 6 bytes.

FP8 Distributed Parallel Training

  • Data Parallelism and Pipeline Parallelism shows no impact
  • Tensor Parallelism
    • Distributes each tensor in its entirety across devices while taking tensor scaling into account.
      • Not scalable with a large model and large tensor???

Evaluation

Evaluated with 40B tokens on GPT style models.

Sequence Parallelism: Long Sequence Training from System Perspective

Insights

Innovations

  • Split long sequences into multiple chunks and feeds into different devices.
    • Ring Self-Attention (RSA): circulates key and value embeddings in ring manner
  • Applicable with data, tensor and pipeline parallelism to form 4D parallelism.
  • Applicable with shorter sequence modeling and sparse attention.

Sequence Parallelism

  • Limited to bidirectional self-attention.
  • Design
    • 2-stage
    • Transmitting key embeddings among devices to calculate attention score.
      • Perform N – 1 times so the local query slice can run QK^T with all local/remote key slices
      • The final output is a slice of QK^T along sequence_q dimension
    • Transmitting value embeddings among devices to calculate the output of attention layers.
      • Perform N – 1 times so the local score slice can run SV with all local/remote value slices
      • The final output is a slice of SV along sequence_q dimension
  • Modeling
    • Memory usage
      • For both MLP and MHA, it is more efficient than tensor parallelism if BL (batch and sequence length) is large enough.
    • Communication cost
      • Sequence parallelism
        • FWD: 2 ring-style P2P communication
        • BWD: 2 all-reduce collective communication and 2 ring-style P2P communication
      • Tensor parallelism
        • FWD/BWD: 4 collective communication
      • Sequence has the same communication overhead as tensor. But better compatibility with pipeline as not need to run all-gather through pipeline stages.

Rethinking Attention With Performers

  • Perform product of KV first instead of QK

Tags

Leave a comment