Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
- Problem Definition
- Prefill: Compute bound. Superlinear with context length. Measured by TTFT.
- Decode: Memory bound. Sublinear with context length. Measured by TTIT.
- Baseline disaggregated LLM serving architecture
- Reuse KVCache to reduce compute resources -> Increase TTFT
- Maximize number of tokens in each batch to improve flops -> Increase TTIT
- Innovation
- KVCache-centric disaggregated architecture
- Transfer KVCache to prefill instance
- Complete prefill in chunks/layers and stream output KVCache to decode instance
- Load KVCache and add request to continuous batching process
- KVCache-centric conductor
- Predict future use of KVCache blocks.
- Swapping and replication KVCache.
- Managed resource scaling or request rejections.
- KVCache-centric disaggregated architecture
- Mooncake’s Disaggregated Architecture
- Hardware
- Separates prefill and decode hosts.
- Groups hardwares (CPU, DRAM, SSD, RDRAM) to implement disaggregated KVCache.
- KVCache Reuse
- Prefill hosts receive raw inputs, block IDs of prefill cache, block IDs of full cache. Balancing.
- Transfer KVCache from remote CPUs to local GPUs.
- Incremental Prefill
- Completes prefill with prefill cache and stores incremental KVCache into CPU. If the number of uncached input tokens exceeds threshold, then split into chunks and executes in pipeline manner.
- KVCache Transfer
- Messenger managed KVCache transfer (overlapped with incremental prefill) to stream the KVCache generated by each node and reduce waiting time.
- Decoding
- Decode host run with continuous batching. SLO is double checked and request is rejected if violated.
- Hardware
- Prefill Pool
- Multi-Node Prefill
- CPP (Chunked Pipeline Parallelism)
- Requires communication at boundaries of layers.
- Fits both short and long contexts.
- CPP (Chunked Pipeline Parallelism)
- Layer-wise Prefill
- Prefill KVCache layer by layer
- Multi-Node Prefill
- KVCache-centric Scheduling
- Prefill Global Scheduling
- Model based
- Consider load, prefix cache hit length, distribution of reusable KVCache blocks.
- Cache load balancing
- Heuristic based
- Conductor forwards the cache’s location and the request to an alternative instance if additional prefill time is shorter than transfer time.
- Compute input tokens if best remote prefix match length is no larger than the current local reusable prefix multiplied by a threshold.
- Prefill Global Scheduling
- Overload-oriented Scheduling
- Detach whether to accept prefill/decode requests based on loads
- Early rejection based on prefill/decode pool status.
- If based on current status, it would cause fluctuations due to delays between decision and execution.
- Based on future status through prediction instead.
- Related Work
- FasterTransformer, TensorRT-LLM, DeepSpeed, and, vLLM
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
- Observation
- Decode is more expensive than prefill due to low GPU utilization
- Prefill can saturate the GPU with BS=1 and SEQLEN=1024.
- Decode cannot saturate the GPU. Increasing BS from 1 to 8 almost keeps latency constant, except for the attention component.
- Decode and prefill run with varying batch size and could cause microbatch bubbles.
- Varying number of tokens in consecutive micro-batches.
- Different compute time of prefill and decode stage.
- Difference since accumulated context length (KV) varies across requests.
- Decode is more expensive than prefill due to low GPU utilization
- Chunked Prefill
- Split sequence into chunks with crafted attention masks
- Lower arithmetic intensity.
- Slight overhead of KV cache due to repeated access.
- Okay for a short sequence (1K) as attention takes a small portion.
- Split sequence into chunks with crafted attention masks
- Decode Maximal Batching
- Piggyback decode into prefill chunk
- Fuse all linear operations and separate attention
- Identifying Ideal Chunk Size
- Smaller chunks piggyback more decodes but at the expense of lower prefill efficiency whereas larger chunks are prefill efficient but piggyback fewer decodes.
- Fully utilize GPU tiles and avoid wasted computation on paddings
Leave a comment