Stay humble. Stay hungry. Stay foolish.

Meta AI Infra Talk

Written in

by

Meta Research SuperCluster (SRC)

Fast & flat network

Fast Networking

  • 8 InfiniBand links per server (2x more)
  • 200Gbps links (2x faster)
  • No oversubscription (2.3x faster)

Scale

  • Largest known flat InfiniBand fabric
    • ~48K links
    • ~2k switches

Flat

  • Scheduler sees homogeneity

    • Workloads free from topology-awareness
    • No fragmentation
  • 4096 GPU Nccl AllReduce runs at > 157GBs

Concepts and References

  • Nvidia’s Tech Blog on NvSwitch

  • NVLink:

    • Interconnect communication protocol (d2d). 1-1 connection.
    • Static routing.
    • In comparison with PCIe (h2d), InfiniBand (h2h), Ethernet (h2h), etc.
  • NVSwitch:

    • Interconnection hardware. N-1 connection.
    • Dynamic routing.
    • Since 3th generation, it can run primitive reduction operations on switch (SHARP engine).
  • Software Abstractions Provided by NVLink:

    • Each server has its own memory space.
    • All SMs inside the GPUs of the server are combined and scheduled together.
    • NCCL.

Faster GPUs

  • 2000 Nvidia DGX A100 Systems (each 8xA100s)
  • 640Gb GPU Memory
  • Ethernet 200Gb/s through put

Data: AIRStore (AI Research Store)

  • 6 TBs Cached for datasets totaling up to 40PB.
  • 1 Gb per GPU.

Storage: NFS

  • Mainly to store checkpoint.
  • The key is about resume at any node. Keep it stateless. Available across cluster.
  • Encryption and privacy. TTL.

Orchestration

  • Researcher: Cluster -> Job Queue -> Slurm -> Home/NFS/AIRStore.
  • Cluster is homogeneous.

Lessons Learned

  • Build large scale system in a phased approach.
  • Failure rates. Higher than anticipated.
  • Scheduling and prioritization.
  • GPU failres.

Next Generation Data Center Design

About psychical cooling and machines. ME problem.

PyTorch 2.0

Graph mode. Ease-of-use UX.

  • torch.compile(model)

TorchDynamo

  • Resolves problems in graph capture.
    • Make model capturable. v.s. Make sure captured graph is correct.

Key Features

  • Partical graph capture.
    • Ability to skip unwanted parts of eager.
  • Guarded graphs
    • Ability to check if captured graph is valid for execution.
  • Just-in-Time in eager
    • Recapture a graph is capture graph is invalid for execution.

Implementation

Transformed PyCodeObject. PEP 523 API.

TorchInductor

  • A PyTorch-native compiler.
  • Python first. Breadth first. Pytorch native.

Future Works

  • Distributed Compiler.
  • Export.

Meta’s First Generation of AI Accelerators (MTIA)

Inference Focused. RISC-V based.

Launching a successful accelerator program

  • End results
    • Improve user experience in Meta apps through modeling
  • Ecosystem
    • Pytorch for models. Triton for kernels. MLIR for compiler.
  • Design Process
    • Chip/System design with open source components and vendor partners (RISC-V, LLVM compiler).

Architecture Design

Low power design. 25W.

  • 8X8 Processing elements (PE).
    • 128KB of local memory.
  • 128MB on chip SRAM.
  • 16 channels of LPDDR5 memory, up to 64GB off-chip DRAM capacity.

System Design

Up to 12 accelerator boards per-host. PCIe switch.

Software Stack

Application -> PyTorch -> AFG (FX Compiled Subgraph Exector) / MTIA Tensor, Mem Allocator, Stream Interface / Eager MTIA PyTorch Operators -> MTIA Streaming API / Firmware Driver -> MTIA Firmware

  • PyTorch for a familiar developer experience.
  • FX IR for model level optimization.
  • LLVM IR for lower-level optimizations.
  • Runtime and firmware provide stream similar to CUDA.
  • Varying levels of abstraction for writing kernels.

Model charactericis

Industry Trend

Compute scales much faster than memory and interconnect.

Tags

Leave a comment