Stay humble. Stay hungry. Stay foolish.

Reading Notes – 8

Written in

by

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

  • Problem Definition
    • Prefill: Compute bound. Superlinear with context length. Measured by TTFT.
    • Decode: Memory bound. Sublinear with context length. Measured by TTIT.
    • Baseline disaggregated LLM serving architecture
      • Reuse KVCache to reduce compute resources -> Increase TTFT
      • Maximize number of tokens in each batch to improve flops -> Increase TTIT
  • Innovation
    • KVCache-centric disaggregated architecture
      • Transfer KVCache to prefill instance
      • Complete prefill in chunks/layers and stream output KVCache to decode instance
      • Load KVCache and add request to continuous batching process
    • KVCache-centric conductor
      • Predict future use of KVCache blocks. 
      • Swapping and replication KVCache.
    • Managed resource scaling or request rejections.
  • Mooncake’s Disaggregated Architecture
    • Hardware
      • Separates prefill and decode hosts.
      • Groups hardwares (CPU, DRAM, SSD, RDRAM) to implement disaggregated KVCache.
    • KVCache Reuse
      • Prefill hosts receive raw inputs, block IDs of prefill cache, block IDs of full cache. Balancing.
      • Transfer KVCache from remote CPUs to local GPUs.
    • Incremental Prefill
      • Completes prefill with prefill cache and stores incremental KVCache into CPU.  If the number of uncached input tokens exceeds threshold, then split into chunks and executes in pipeline manner.
    • KVCache Transfer
      • Messenger managed KVCache transfer (overlapped with incremental prefill) to stream the KVCache generated by each node and reduce waiting time.
    • Decoding
      • Decode host run with continuous batching. SLO is double checked and request is rejected if violated.
  • Prefill Pool
    • Multi-Node Prefill
      • CPP (Chunked Pipeline Parallelism)
        • Requires communication at boundaries of layers.
        • Fits both short and long contexts.
    • Layer-wise Prefill
      • Prefill KVCache layer by layer
  • KVCache-centric Scheduling
    • Prefill Global Scheduling
      • Model based
      • Consider load, prefix cache hit length, distribution of reusable KVCache blocks.
    • Cache load balancing
      • Heuristic based
      • Conductor forwards the cache’s location and the request to an alternative instance if additional prefill time is shorter than transfer time.
      • Compute input tokens if best remote prefix match length is no larger than the current local reusable prefix multiplied by a threshold.
  • Overload-oriented Scheduling
    • Detach whether to accept prefill/decode requests based on loads
    • Early rejection based on prefill/decode pool status.
      • If based on current status, it would cause fluctuations due to delays between decision and execution.
      • Based on future status through prediction instead.
  • Related Work
    • FasterTransformer, TensorRT-LLM, DeepSpeed, and, vLLM

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

  • Observation
    • Decode is more expensive than prefill due to low GPU utilization
      • Prefill can saturate the GPU with BS=1 and SEQLEN=1024.
      • Decode cannot saturate the GPU. Increasing BS from 1 to 8 almost keeps latency constant, except for the attention component.
    • Decode and prefill run with varying batch size and could cause microbatch bubbles.
      • Varying number of tokens in consecutive micro-batches.
      • Different compute time of prefill and decode stage.
      • Difference since accumulated context length (KV) varies across requests.
  • Chunked Prefill
    • Split sequence into chunks with crafted attention masks
      • Lower arithmetic intensity.
      • Slight overhead of KV cache due to repeated access.
        • Okay for a short sequence (1K) as attention takes a small portion.
  • Decode Maximal Batching
    • Piggyback decode into prefill chunk
    • Fuse all linear operations and separate attention
  • Identifying Ideal Chunk Size
    • Smaller chunks piggyback more decodes but at the expense of lower prefill efficiency whereas larger chunks are prefill efficient but piggyback fewer decodes.
    • Fully utilize GPU tiles and avoid wasted computation on paddings

Tags

Leave a comment