<aside> 💡

PD 分离;

分布式KV Cache;

</aside>

论文

论文地址

 [DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving](<https://arxiv.org/pdf/2401.09670>)   Peking University

开源地址 https://github.com/LLMServe/DistServe git
Project page

Throughput is Not All You Need: Maximizing Goodput in LLM Serving using Prefill-Decode Disaggregation

1. 背景

1.1 Throughput? Goodput !

而 LLM 服务主要包含以下两类 SLO：

TTFT：Time to first token，即首 token 延时。
TPOT：Time per output token，每个 token 的生成延时。