论文

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

陈天奇 CMU Dec 2023

2 BACKGROUND

2.1 Transformer-based LLM

Transformer-based Large Language Models (LLMs) have marked a significant shift in the field of natural language processing, introducing a new paradigm for understanding and generating human language. Central to this innovation is the Transformer architecture, which is built upon the concept of self-attention mechanisms [253], allowing the model to weigh the importance of different parts of the input data when making predictions. Mathematically, the self-attention mechanism in Transformers can be described as follows: For an input sequence 𝑋 = [𝑥1, 𝑥2, ..., 𝑥𝑛], the Transformer computes a set of queries 𝑄, keys 𝐾 and values 𝑉 using linear transformations of 𝑋 . The self-attention scores are then computed as:

基于Transformer的大型语言模型(LLMs)在自然语言处理领域标志着一个重大转变,引入了一种新的理解和生成人类语言的范式。这一创新的核心是Transformer架构,它建立在自注意力机制的概念之上[253],允许模型在进行预测时权衡输入数据不同部分的重要性。数学上,Transformer中的自注意力机制可以描述如下:对于输入序列𝑋 = [𝑥1, 𝑥2, ..., 𝑥𝑛],Transformer计算一组查询𝑄、键𝐾和值𝑉,它们是𝑋的线性变换。然后计算自注意力分数如下:

where 𝑑𝑘 is the dimension of the keys. This mechanism allows the model to focus on different parts of the input sequence for each element of the output, capturing complex dependencies regardless of their distance in the input sequence.

这里 \( d_k \) 是键的维度。该机制使得模型能够针对输出的每个元素关注输入序列的不同部分,从而捕捉复杂的依赖关系,无论它们在输入序列中的距离如何。

Another important structure in Transformers is the Feed-Forward Network (FFN), which is present in each layer of the Transformer and significantly contributes to its computational intensity. The FFN typically consists of two linear transformations with a non-linear activation function in between, usually represented as:

Transformer中的另一个重要结构是前馈网络(Feed-Forward Network,FFN),它存在于Transformer的每一层中,并且显著地贡献了其计算强度。FFN通常由两个线性变换组成,中间夹着一个非线性激活函数,通常表示为:

Here, 𝑊1, 𝑊2, 𝑏1, and 𝑏2 are learnable parameters of the FFN, and the non-linear function max(0, ·) (ReLU, in this case) introduces the necessary non-linearity into the model, allowing it to learn more complex patterns. The FFN is responsible for a significant portion of the model’s parameter count and, consequently, its memory footprint and computational load. In each Transformer layer, after the multi-head attention (MHA) aggregates information from different parts of the input, the FFN processes this aggregated information independently for each position. This parallel processing capability is a key strength of the Transformer, allowing it to handle sequences effectively. However, it also means that the computational load and memory requirements scale with the length of the input sequence and the depth of the network.

在这里,𝑊1、𝑊2、𝑏1和𝑏2是FFN的可学习参数,而非线性函数max(0, ·)(在这个情况下是ReLU)引入了模型所需的非线性,使其能够学习更复杂的模式。FFN负责模型参数数量的很大一部分,因此也影响了其内存占用和计算负载。在每个Transformer层中,多头注意力(MHA)从输入的不同部分聚合信息后,FFN独立地处理每个位置的这些聚合信息。这种并行处理能力是Transformer的一个关键优势,使其能够有效地处理序列。然而,这也意味着计算负载和内存需求随着输入序列的长度和网络的深度而增加。

The combination of self-attention and FFN in Transformer-based LLMs enables these models to capture a wide range of linguistic contexts and nuances, setting new benchmarks in various NLP tasks. However, the substantial computational requirements for training and inference have become a critical area of research, focusing on optimizing these aspects without significantly compromising performance. The Transformer model also includes other key components like position encoding, which adds information about the position of each token in the sequence, and the multi-head attention mechanism, which allows the model to focus on different parts of the sequence in different representational spaces.

Transformer基础的大型语言模型(LLMs)中自注意力和前馈网络(FFN)的结合使得这些模型能够捕捉广泛的语言环境和细微差别,在各种自然语言处理(NLP)任务中树立了新的基准。然而,训练和推理的大量计算需求已成为研究的关键领域,重点是在不显著牺牲性能的情况下优化这些方面。Transformer模型还包括其他关键组件,如位置编码,它增加了有关序列中每个标记位置的信息,以及多头注意力机制,它允许模型在不同的表示空间中关注序列的不同部分。

2.2 GPUs and Other Accelerators

2.3 LLM Inference