论文

A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms

Beihang University 2409

1. Introduction

1728116818721.png

2. Basics of Low-bit LLMs

2.1. Low-bit Number Formats

2.1.1. Standard Formats

1729570756080.png

2.2. Quantization Granularity

1729570823232(1).png

Quantization granularity refers to the different weight/activation partitions corresponding to each element of the scaling factor and zeropoint. It determines how finely the scale recovers and the zero point shifts. Figure 2 showcases five fundamental types of quantization granularity: tensor-wise, token-wise, channelwise, group-wise, and element-wise.

量化粒度指的是不同的权重/激活分区对应于缩放因子和零点的每个元素。它决定了缩放恢复的精细程度和零点偏移的程度。图2展示了五种基本的量化粒度类型:tensor-wise, token-wise, channelwise, group-wise, and element-wise。