Key features [2]

Page attention[1][11] memory sharing
Continuous batching of incoming requests [10]
Quantization:
- GPTQ
- AWQ
- SqueezeLLM
- FP8 KV Cache

Features[12]

kernel优化

CUDA Graph

Multi-step Scheduling

Chunked prefill