TensorRT-LLM
key features [2]
- Flash Attention
- MHA/MQA/GQA
- Quantization
- Weight-Only
- SmoothQuant
- GPTQ
- AWQ
- FP8
- Paged KV Cache for the Attention
- Multi-GPU Multi-Node
- TP(Tensor Parallelism)/PP(Pipeline Parallelism)
- In-flight Batching
TensorRT-LLM 推理部署
[基于docker的部署]
参考
原理
- TensorRT-LLM git
1xx. TensorRT-LLM保姆级教程(一)-快速入门
实战