Quantization has established itself as the primary approach for decreasing the computational and storage expenses associated with Large Language Models (LLMs) inference. The majority of current research emphasizes quantizing weights and activations to enable low-bit general-matrix-multiply (GEMM) operations, with the remaining non-linear operations executed at higher precision. In our study, we discovered that following the application of these techniques, the primary bottleneck in LLMs inference lies in the softmax layer. The softmax operation comprises three phases: exponent calculation, accumulation, and normalization, Our work focuses on optimizing the first two phases. We propose an analytical approach to determine the optimal clipping value for the input to the softmax function, enabling sub-4-bit quantization for LLMs inference. This method accelerates the calculations of both $e^x$ and $\sum(e^x)$ with minimal to no accuracy degradation. For example, in LLaMA1-30B, we achieve baseline performance with 2-bit quantization on the well-known "Physical Interaction: Question Answering" (PIQA) dataset evaluation. This ultra-low bit quantization allows, for the first time, an acceleration of approximately 4x in the accumulation phase. The combination of accelerating both $e^x$ and $\sum(e^x)$ results in a 36.9% acceleration in the softmax operation.
翻译:量化已成为降低大语言模型推理计算与存储开销的主要方法。当前研究大多侧重于对权重和激活值进行量化,以实现低位宽通用矩阵乘运算,而剩余非线性操作仍以较高精度执行。本研究发现,在应用这些技术后,大语言模型推理的主要瓶颈在于softmax层。Softmax运算包含三个阶段:指数计算、累加和归一化,本研究重点优化前两个阶段。我们提出一种分析性方法来确定softmax函数输入的最佳截断值,从而实现大语言模型推理的亚4位量化。该方法能加速$e^x$和$\sum(e^x)$的计算,且精度损失极小甚至为零。例如在LLaMA1-30B模型中,我们在知名"物理交互:问答"数据集评估中,通过2位量化达到了基线性能。这种超低位宽量化首次实现了累加阶段约4倍的加速。$e^x$与$\sum(e^x)$的双重加速使softmax运算整体获得36.9%的加速效果。