论文

论文地址 A Survey on Efficient Inference for Large Language Models
```
Tsinghua University    Jul 2024
```

Abstract

++ Large Language Models (LLMs) have attracted extensive attention due to their remarkable performance across various tasks. However, the substantial computational and memory requirements of LLM inference pose challenges for deployment in resource-constrained scenarios. Efforts within the field have been directed towards developing techniques aimed at enhancing the efficiency of LLM inference. This paper presents a comprehensive survey of the existing literature on efficient LLM inference. We start by analyzing the primary causes of the inefficient LLM inference, i.e., the large model size, the quadraticcomplexity attention operation, and the auto-regressive decoding approach. Then, we introduce a comprehensive taxonomy that organizes the current literature into data-level, model-level, and system-level optimization. Moreover, the paper includes comparative experiments on representative methods within critical sub-fields to provide quantitative insights. Last but not least, we provide some knowledge summary and discuss future research directions.

大模型由于其在各种任务中的出色表现而引起了广泛的关注。然而，大模型推理的大量计算和内存需求对其在资源受限场景的部署提出了挑战。业内一直在努力开发旨在提高大模型推理效率的技术。本文对现有的关于高效大模型推理的文献进行了全面的综述总结。首先分析了大模型推理效率低下的主要原因，即大模型参数规模、注意力计算操的二次复杂度作和自回归解码方法。然后，引入了一个全面的分类法，将现有优化工作划分为数据级别、模型级别和系统级别的优化。此外，本文还对关键子领域的代表性方法进行了对比实验，以及分析并给出一定的见解。最后，对相关工作进行总结，并对未来的研究方向进行了讨论。

1 INTRODUCTION [1, 2, 3]

Large Language Models (LLMs) have garnered substantial attention from both academia and industry in recent years. The field of LLMs has experienced notable growth and significant achievements. Numerous open-source LLMs have emerged, including the GPT-series (GPT-1 [1], GPT-2 [2], and GPT-3 [3]), OPT [4], LLaMA-series (LLaMA [5], LLaMA 2 [5], Baichuan 2 [6], Vicuna [7], LongChat [8]), BLOOM [9], FALCON [10], GLM [11], and Mistral [12], which are used for both academic research and commercial purposes. The success of LLMs stems from their robust capability in handling diverse tasks such as neural language understanding (NLU), neural language generation (NLG), reasoning [13], [14], and code generation [15], consequently enabling impactful applications like ChatGPT, Copilot, and Bing. There is a growing belief [16] that the rise and achievements of LLMs signify a significant stride towards Artificial General Intelligence (AGI) for humanity.

LLM领域经历了显著的增长和显著的成就。许多开源llm已经出现，包括gpt-系列(GPT-1， GPT-2和GPT-3)， OPT， lama系列(LLaMA ， LLaMA 2，BaiChuan 2 ，Vicuna， LongChat)， BLOOM， FALCON， GLM和Mtaistral[12]，他们用于学术研究和商业落地。大模型的成功源于其处理各种任务的强大能力，如神经语言理解(NLU)、神经语言生成(NLG)、推理和代码生成[15]，从而实现了ChatGPT、Copilot和Bing等有影响力的应用程序。越来越多的人认为[16]LMM士的崛起和取得的成就标志着人类向通用人工智能(AGI)迈进了一大步。

However, the deployment of LLMs is not always going smoothly. As shown in Fig. 1, LLMs typically demand higher computational cost, memory access cost and memory usage in their inference process (we will analyse the root causes in the Sec. 2.3), which deteriorates the efficiency indicators (e.g., latency, throughput, power consumption and storage) in the resource-constrained scenarios. This poses challenges for the application of LLMs in both edge and cloud scenarios. For example, the immense storage requirements render the deployment of a 70-billion-parameter model impractical on personal laptops for tasks such as development assistance. Additionally, the low throughput would result in significant costs if LLMs are used for every search engine request, leading to a considerable reduction in the profits of the search engine.

然而，LLM的部署并不总是很顺利。如图1所示，在推理过程中，使用LLM通常需要更高的计算成本，内存访问成本和内存占用。（根本原因分析见Sec. 2.3）在资源受限的场景中，推理效率也会降低（如，延迟，吞吐量，功耗和存储）。这对LLM在终端以及云场景这两方面的应用带来了挑战。例如，巨大的存储需求使得在个人笔记本电脑上部署70B参数量的模型来用于辅助开发是不切实际的。此外，如果将LLM用于每一个搜索引擎请求，那么低吞吐量将带来巨大的成本，从而导致搜索引擎利润的大幅减少。

Fortunately, a substantial array of techniques has been proposed to enable efficient inference for LLMs. To gain a comprehensive understanding of existing studies and inspire further research, this survey employs a hierarchical classification and systematic summarization of the current landscape of efficient LLM inference. Specifically, we categorize relevant studies into three levels: data-level optimization, model-level optimization, and system-level optimization (refer to Sec. 3 for elaboration). Moreover, we conduct experimental analyses on representative methods within critical sub-fields to consolidate knowledge, offer practical recommendations, and provide guidance for future research endeavors.

幸运的是，大量的技术已经被提出来，以实现LLM的有效推理。为了获得对现有研究的全面了解，并激发进一步的研究，文章对当前现有的LLM高效推理工作采用了分级分类和系统总结。具体来说，将现有工作划分组织为数据级别、模型级别和系统级别的优化。此外，文章对关键子领域内的代表性方法进行了实验分析，以巩固知识，提供实际性建议并为未来的研究努力提供指导。

Currently, several surveys [17], [18], [19], [20], [21], [22], [23] have been conducted in the field of efficient LLMs. These surveys primarily focus on different aspects of LLM efficiency but offer opportunities for further improvement. Zhu et al. [17], Park et al. [18], Wang et al. [19] and Tang et al. [20] concentrate on model compression techniques within model-level optimization. Ding et al. [21] center on efficiency research considering both data and model architecture perspectives. Miao et al. [22] approach efficient LLM inference from a machine learning system (MLSys) research perspective. In contrast, our survey provides a more comprehensive research scope, addressing optimization at three levels: data-level, model-level, and system-level, with the inclusion of recent advancements. While Wan et al. [23] and Xu et al. [24] also deliver comprehensive review of efficient LLM research, our work extends by incorporating comparative experiments and offering practical insights and recommendations based on experimental analyses in several critical sub-fields like model quantization and serving systems. A comparison of these surveys is summarized in Table 1.

目前，综述[17],[18],[19],[20],[21],[22]均涉及LLM领域。这些综述主要集中在LLM效率的不同方面，但提供了进一步改进的机会。Zhu等[17],Park等[18]和Wang等。[19]将综述的重心放在，模型压缩技术上，是模型级别的优化。Ding等[20]将数据和模型架构作为研究重心。Miao等[21]从机器学习系统(MLSys)研究的角度研究LLM的有效推理。相比之下，本文提供了一个更全面的研究范围，在三个层次上解决优化：数据级别、模型级别和系统级别，同时也囊括了最近的研究工作。而Wan等[22]和Xu等[23]也对高效LLM研究进行了全面综述。基于在几个关键的子领域如模型量化和模型server端中进行的实验分析，本文通过整合对比实验，提供实际的见解和建议。如表1所示，展示了各种综述之间的比较。