2024
-
Model
- Mistral 7B
- GQA (Grouped Query Attention)
- SWA (Sliding Window Attention)
- Rolling Buffer Cache (KV Cache)
- Pre-fill and Chunking Cache with Prompt
- Instruction Fine-tuning
- Mistral 8x7B
- SMoE (Sparse Mixture of Experts)
- Use gating network to choose from a sparse expers.
- Use export parallelism to handle routing on multi-GPU inference.
- SMoE (Sparse Mixture of Experts)
- Sparse Gated MoE
- Hierarchical MoEs. Having gated network as MoE as well.
- Performance Challenges:
- Shrinking batch problem
- Mixing data parallelism on standard and gating layer + model parallelism on experts.
- Taking advantage of convolutionality. Apply MoE to all timestemps together as one big batch.
- Increasing batch size for recurrent moe.
- Network bandwidth
- Larger hidden layer.
- Shrinking batch problem
- Balance expert utilization
- Additional loss as regularization term.
- Transformer-XL
- Mistral 7B
-
Infra
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
- Optimizing quantization error using Hessian inverse through iterative per-column PTQ.
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
- Scaling important weights based on impact on activations while scarfing other weights.
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
-
Courses
2023
- Model
- Infra
- Courses
2022
- Model
2021
2020
- Model
Leave a comment