Key features [2]
Page attention
[1][11] memory sharing
Continuous batching
of incoming requests [10]
Quantization:
GPTQ
AWQ
SqueezeLLM
FP8 KV Cache
Features[12]
kernel优化
CUDA Graph
Multi-step Scheduling
Chunked prefill