03/11/2024-03/17/2024
GPT-3: Large Language Models are Few Shot Learners
Insights
- In-context learning with a large pre-trained model delivers supreme performance.
Observations
- Impossible to train domain-specific models with dedicated datasets.
- Generalized pre-training underperforms due to out-of-norm distribution in datasets.
- Humans don’t require large supervised datasets to learn.
Training
- Data: CommonCrawl + Filter/Fuzzy deduplication/High quality reference corpora.
- Large batch size and small learning rate.
- Use gradient noise scale.
- Model parallelism across GEMM and layers.
Architecture
- Same as GPT-2. Modified initialization. Pre-normalization. Reversible tokenization.
- Except using alternating dense and locally banded sparse attention patterns in the layers of the transformer.
Blockwise Parallel Transformer
Innovation
- Split feed forward layers into blocks.
- Attention: FlashAttention algorithm. Blocked.
- FFN: Tlining on query dimension.
Ring Attention
Innovation
- Ring based communication overlapped with computation.
Leave a comment