Meta Research SuperCluster (SRC)
Fast & flat network
Fast Networking
- 8 InfiniBand links per server (2x more)
- 200Gbps links (2x faster)
- No oversubscription (2.3x faster)
Scale
- Largest known flat InfiniBand fabric
- ~48K links
- ~2k switches
Flat
-
Scheduler sees homogeneity
- Workloads free from topology-awareness
- No fragmentation
-
4096 GPU Nccl AllReduce runs at > 157GBs
Concepts and References
-
NVLink:
- Interconnect communication protocol (d2d). 1-1 connection.
- Static routing.
- In comparison with PCIe (h2d), InfiniBand (h2h), Ethernet (h2h), etc.
-
NVSwitch:
- Interconnection hardware. N-1 connection.
- Dynamic routing.
- Since 3th generation, it can run primitive reduction operations on switch (SHARP engine).
-
Software Abstractions Provided by NVLink:
- Each server has its own memory space.
- All SMs inside the GPUs of the server are combined and scheduled together.
- NCCL.
Faster GPUs
- 2000 Nvidia DGX A100 Systems (each 8xA100s)
- 640Gb GPU Memory
- Ethernet 200Gb/s through put
Data: AIRStore (AI Research Store)
- 6 TBs Cached for datasets totaling up to 40PB.
- 1 Gb per GPU.
Storage: NFS
- Mainly to store checkpoint.
- The key is about resume at any node. Keep it stateless. Available across cluster.
- Encryption and privacy. TTL.
Orchestration
- Researcher: Cluster -> Job Queue -> Slurm -> Home/NFS/AIRStore.
- Cluster is homogeneous.
Lessons Learned
- Build large scale system in a phased approach.
- Failure rates. Higher than anticipated.
- Scheduling and prioritization.
- GPU failres.
Next Generation Data Center Design
About psychical cooling and machines. ME problem.
PyTorch 2.0
Graph mode. Ease-of-use UX.
torch.compile(model)
TorchDynamo
- Resolves problems in graph capture.
- Make model capturable. v.s. Make sure captured graph is correct.
Key Features
- Partical graph capture.
- Ability to skip unwanted parts of eager.
- Guarded graphs
- Ability to check if captured graph is valid for execution.
- Just-in-Time in eager
- Recapture a graph is capture graph is invalid for execution.
Implementation
Transformed PyCodeObject. PEP 523 API.
TorchInductor
- A PyTorch-native compiler.
- Python first. Breadth first. Pytorch native.
Future Works
- Distributed Compiler.
- Export.
Meta’s First Generation of AI Accelerators (MTIA)
Inference Focused. RISC-V based.
Launching a successful accelerator program
- End results
- Improve user experience in Meta apps through modeling
- Ecosystem
- Pytorch for models. Triton for kernels. MLIR for compiler.
- Design Process
- Chip/System design with open source components and vendor partners (RISC-V, LLVM compiler).
Architecture Design
Low power design. 25W.
- 8X8 Processing elements (PE).
- 128KB of local memory.
- 128MB on chip SRAM.
- 16 channels of LPDDR5 memory, up to 64GB off-chip DRAM capacity.
System Design
Up to 12 accelerator boards per-host. PCIe switch.
Software Stack
Application -> PyTorch -> AFG (FX Compiled Subgraph Exector) / MTIA Tensor, Mem Allocator, Stream Interface / Eager MTIA PyTorch Operators -> MTIA Streaming API / Firmware Driver -> MTIA Firmware
- PyTorch for a familiar developer experience.
- FX IR for model level optimization.
- LLVM IR for lower-level optimizations.
- Runtime and firmware provide stream similar to CUDA.
- Varying levels of abstraction for writing kernels.
Model charactericis
Industry Trend
Compute scales much faster than memory and interconnect.
Leave a comment