Meta Research SuperCluster (SRC)

Fast & flat network

Fast Networking

8 InfiniBand links per server (2x more)
200Gbps links (2x faster)
No oversubscription (2.3x faster)

Scale

Largest known flat InfiniBand fabric
- ~48K links
- ~2k switches

Flat

Scheduler sees homogeneity
- Workloads free from topology-awareness
- No fragmentation
4096 GPU Nccl AllReduce runs at > 157GBs

Concepts and References

Nvidia’s Tech Blog on NvSwitch
NVLink:
- Interconnect communication protocol (d2d). 1-1 connection.
- Static routing.
- In comparison with PCIe (h2d), InfiniBand (h2h), Ethernet (h2h), etc.
NVSwitch:
- Interconnection hardware. N-1 connection.
- Dynamic routing.
- Since 3th generation, it can run primitive reduction operations on switch (SHARP engine).
Software Abstractions Provided by NVLink:
- Each server has its own memory space.
- All SMs inside the GPUs of the server are combined and scheduled together.
- NCCL.

Faster GPUs

2000 Nvidia DGX A100 Systems (each 8xA100s)
640Gb GPU Memory
Ethernet 200Gb/s through put

Data: AIRStore (AI Research Store)

6 TBs Cached for datasets totaling up to 40PB.
1 Gb per GPU.

Storage: NFS

Mainly to store checkpoint.
The key is about resume at any node. Keep it stateless. Available across cluster.
Encryption and privacy. TTL.

Orchestration

Researcher: Cluster -> Job Queue -> Slurm -> Home/NFS/AIRStore.
Cluster is homogeneous.

Lessons Learned

Build large scale system in a phased approach.
Failure rates. Higher than anticipated.
Scheduling and prioritization.
GPU failres.

Next Generation Data Center Design

About psychical cooling and machines. ME problem.

PyTorch 2.0

Graph mode. Ease-of-use UX.

torch.compile(model)

TorchDynamo

Resolves problems in graph capture.
- Make model capturable. v.s. Make sure captured graph is correct.

Key Features

Partical graph capture.
- Ability to skip unwanted parts of eager.
Guarded graphs
- Ability to check if captured graph is valid for execution.
Just-in-Time in eager
- Recapture a graph is capture graph is invalid for execution.

Implementation

Transformed PyCodeObject. PEP 523 API.

TorchInductor

A PyTorch-native compiler.
Python first. Breadth first. Pytorch native.

Future Works

Distributed Compiler.
Export.

Meta’s First Generation of AI Accelerators (MTIA)

Inference Focused. RISC-V based.

Launching a successful accelerator program

End results
- Improve user experience in Meta apps through modeling
Ecosystem
- Pytorch for models. Triton for kernels. MLIR for compiler.
Design Process
- Chip/System design with open source components and vendor partners (RISC-V, LLVM compiler).

Architecture Design

Low power design. 25W.

8X8 Processing elements (PE).
- 128KB of local memory.
128MB on chip SRAM.
16 channels of LPDDR5 memory, up to 64GB off-chip DRAM capacity.

System Design

Up to 12 accelerator boards per-host. PCIe switch.

Software Stack

Application -> PyTorch -> AFG (FX Compiled Subgraph Exector) / MTIA Tensor, Mem Allocator, Stream Interface / Eager MTIA PyTorch Operators -> MTIA Streaming API / Firmware Driver -> MTIA Firmware

PyTorch for a familiar developer experience.
FX IR for model level optimization.
LLVM IR for lower-level optimizations.
Runtime and firmware provide stream similar to CUDA.
Varying levels of abstraction for writing kernels.

Model charactericis

Industry Trend

Compute scales much faster than memory and interconnect.

levendlee

Meta AI Infra Talk

Meta Research SuperCluster (SRC)

Fast & flat network

Fast Networking

Scale

Flat

Concepts and References

Faster GPUs

Data: AIRStore (AI Research Store)

Storage: NFS

Orchestration

Lessons Learned

Next Generation Data Center Design

PyTorch 2.0

Graph mode. Ease-of-use UX.

TorchDynamo

Key Features

Implementation

TorchInductor

Future Works

Meta’s First Generation of AI Accelerators (MTIA)

Launching a successful accelerator program

Architecture Design

System Design

Software Stack

Model charactericis

Industry Trend

Leave a comment Cancel reply

Meta AI Infra Talk

Meta Research SuperCluster (SRC)

Fast & flat network

Fast Networking

Scale

Flat

Concepts and References

Faster GPUs

Data: AIRStore (AI Research Store)

Storage: NFS

Orchestration

Lessons Learned

Next Generation Data Center Design

PyTorch 2.0

Graph mode. Ease-of-use UX.

TorchDynamo

Key Features

Implementation

TorchInductor

Future Works

Meta’s First Generation of AI Accelerators (MTIA)

Launching a successful accelerator program

Architecture Design

System Design

Software Stack

Model charactericis

Industry Trend

Share this:

Leave a comment Cancel reply