TensorRT-LLM

key features [2]

Flash Attention
MHA/MQA/GQA
Quantization
- Weight-Only
- SmoothQuant
- GPTQ
- AWQ
- FP8
Paged KV Cache for the Attention
Multi-GPU Multi-Node
TP(Tensor Parallelism)/PP(Pipeline Parallelism)
In-flight Batching

TensorRT-LLM 推理部署

[基于docker的部署]

参考

原理

TensorRT-LLM git

1xx. TensorRT-LLM保姆级教程（一）-快速入门

实战