Stay humble. Stay hungry. Stay foolish.

Reading Notes – 3

Written in

by

03/11/2024-03/17/2024

GPT-3: Large Language Models are Few Shot Learners

Insights

  • In-context learning with a large pre-trained model delivers supreme performance.

Observations

  • Impossible to train domain-specific models with dedicated datasets.
  • Generalized pre-training underperforms due to out-of-norm distribution in datasets.
  • Humans don’t require large supervised datasets to learn.

Training

  • Data: CommonCrawl + Filter/Fuzzy deduplication/High quality reference corpora.
  • Large batch size and small learning rate.
    • Use gradient noise scale.
  • Model parallelism across GEMM and layers.

Architecture

  • Same as GPT-2. Modified initialization. Pre-normalization. Reversible tokenization.
  • Except using alternating dense and locally banded sparse attention patterns in the layers of the transformer.

Blockwise Parallel Transformer

Innovation

  • Split feed forward layers into blocks.
    • Attention: FlashAttention algorithm. Blocked.
    • FFN: Tlining on query dimension.

Ring Attention

Innovation

  • Ring based communication overlapped with computation.

Tags

Leave a comment