Paper

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

 [sarathi-serve](<https://github.com/microsoft/sarathi-serve>)  git

Introduction[1]

总结-三种调度策略[3]

continuous batching [iteration-level scheduling] *

continuous batching以request为调度粒度，一个request服务完成，另一个紧跟上。chat类的模型推理，分prefill（执行一次）和decode（auto regression，执行多次）可以有两种batching方式，如下图所示：

上图中两者的区别是，vllm的batching策略中，prefille的处理和decode的处理不并行堆叠，而trtllm的batching策略中，只要有空闲的计算资源，就会将不同或相同的处理尽量堆叠，不管它是prefill的处理，还是decode的处理，最大化device utility。 continuous batching中，prefill阶段，处理的tokens比较多，device utility可能比较高，可能是计算bound。decode阶段，处理的tokens比较少，device utility可能比较低，可能是IO bound。

Paper

Introduction[1]

总结-三种调度策略[3]

**continuous batching [iteration-level scheduling] ***

continuous batching [iteration-level scheduling] *