Disaggregated Serving

1 minute read

Disaggregated Serving is about separeting prefill(generate the first token) and decoding(generate token-by-token autoregressively) phase of LLM, and this blog is from the authors of the original paper and gives great explanations.

0 Motivations

The prefill and decode phase determines TTFT and TPOT(ITL) of LLM output. The throughput is no longer a good measure of the LLM performance, and goodput is adopted to measure requests meeting the SLO(Service-Level Objectives) Alt text

Continuous batching is the used to increase GPU utilization rate and is good for throughput increasement.

1 Colocating issues

The two phases have very distinct characteristics in computation. Prefill is very compute-bound, and decoding is memeory-bound. So colocating resource causes

  • interfences Alt text The waste time is increasing when we get a steady steam of incoming requests. (Good to notice that each request is processed at same rate, no matter it’s prefill or decoding). So we have to over-provision resources to get SLO.
  • couples the resource allocation and parallelum strategies The parallelism strategies (tensor, pipeline, or data parallelism) are inherently coupled for the prefill and decoding computation.

2 Disaggrgating Prefill and Decoding

The idea is simple: disaggregating prefill and decode into different GPUs and customize parallelism strategies for each phase.

Alt text Even though more GPU is used, but per GPU throughput is higher.

Disaggregation comes at the cost of transferring intermediate states (i.e., KV Cache) between prefill and decoding GPUs. With NVLink (600GB/s) and PCI-e 5.0 (64GB/s), the kv cache transfer time is less than decoding 1 step! Alt text

3 Disagg VS Chunked Prefill/Piggybacking

Alt text The key idea of splitfuse to split a lengthy prefill into smaller chunks, thereby forming a batch with better GPU utilization by combining a chunk of prefill with piggybacked decoding tasks.

Pipeline Parallelism splits a model layer-wise, where each GPU is responsible for a subset of layers; compared to Tensor Parallelism which shards each layer across the participating GPUs. PP has a much better compute-communication ratio and does not require expensive interconnects. but it introduces pipeline bubbles

Alt text It may be promising for better throughput but does NOT tradeoff between TTFT and TPOT but to adhere to both.

For TTFT, chunked-prefill cause overheads regardless of chunksize For TPOT, colocating slows down all these decoding tasks

Tags:

Categories:

Updated: