Reading Notes – 8

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

Problem Definition
- Prefill: Compute bound. Superlinear with context length. Measured by TTFT.
- Decode: Memory bound. Sublinear with context length. Measured by TTIT.
- Baseline disaggregated LLM serving architecture
  - Reuse KVCache to reduce compute resources -> Increase TTFT
  - Maximize number of tokens in each batch to improve flops -> Increase TTIT
Innovation
- KVCache-centric disaggregated architecture
  - Transfer KVCache to prefill instance
  - Complete prefill in chunks/layers and stream output KVCache to decode instance
  - Load KVCache and add request to continuous batching process
- KVCache-centric conductor
  - Predict future use of KVCache blocks.
  - Swapping and replication KVCache.
- Managed resource scaling or request rejections.
Mooncake’s Disaggregated Architecture
- Hardware
  - Separates prefill and decode hosts.
  - Groups hardwares (CPU, DRAM, SSD, RDRAM) to implement disaggregated KVCache.
- KVCache Reuse
  - Prefill hosts receive raw inputs, block IDs of prefill cache, block IDs of full cache. Balancing.
  - Transfer KVCache from remote CPUs to local GPUs.
- Incremental Prefill
  - Completes prefill with prefill cache and stores incremental KVCache into CPU. If the number of uncached input tokens exceeds threshold, then split into chunks and executes in pipeline manner.
- KVCache Transfer
  - Messenger managed KVCache transfer (overlapped with incremental prefill) to stream the KVCache generated by each node and reduce waiting time.
- Decoding
  - Decode host run with continuous batching. SLO is double checked and request is rejected if violated.
Prefill Pool
- Multi-Node Prefill
  - CPP (Chunked Pipeline Parallelism)
    - Requires communication at boundaries of layers.
    - Fits both short and long contexts.
- Layer-wise Prefill
  - Prefill KVCache layer by layer
KVCache-centric Scheduling
- Prefill Global Scheduling
  - Model based
  - Consider load, prefix cache hit length, distribution of reusable KVCache blocks.
- Cache load balancing
  - Heuristic based
  - Conductor forwards the cache’s location and the request to an alternative instance if additional prefill time is shorter than transfer time.
  - Compute input tokens if best remote prefix match length is no larger than the current local reusable prefix multiplied by a threshold.
Overload-oriented Scheduling
- Detach whether to accept prefill/decode requests based on loads
- Early rejection based on prefill/decode pool status.
  - If based on current status, it would cause fluctuations due to delays between decision and execution.
  - Based on future status through prediction instead.
Related Work
- FasterTransformer, TensorRT-LLM, DeepSpeed, and, vLLM

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Observation
- Decode is more expensive than prefill due to low GPU utilization
  - Prefill can saturate the GPU with BS=1 and SEQLEN=1024.
  - Decode cannot saturate the GPU. Increasing BS from 1 to 8 almost keeps latency constant, except for the attention component.
- Decode and prefill run with varying batch size and could cause microbatch bubbles.
  - Varying number of tokens in consecutive micro-batches.
  - Different compute time of prefill and decode stage.
  - Difference since accumulated context length (KV) varies across requests.
Chunked Prefill
- Split sequence into chunks with crafted attention masks
  - Lower arithmetic intensity.
  - Slight overhead of KV cache due to repeated access.
    - Okay for a short sequence (1K) as attention takes a small portion.
Decode Maximal Batching
- Piggyback decode into prefill chunk
- Fuse all linear operations and separate attention
Identifying Ideal Chunk Size
- Smaller chunks piggyback more decodes but at the expense of lower prefill efficiency whereas larger chunks are prefill efficient but piggyback fewer decodes.
- Fully utilize GPU tiles and avoid wasted computation on paddings

levendlee

Reading Notes – 8

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Leave a comment Cancel reply

Reading Notes – 8

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Share this:

Leave a comment Cancel reply