We consider the problem of accurate quantization for language models, where both the weights and activations are uniformly quantized to 4 bits per parameter, the lowest bitwidth format natively supported by GPU hardware. In this context, the key challenge is activation quantization: it is known that language models contain outlier channels whose values on average are orders of magnitude higher than than other channels, which prevents accurate low-bitwidth quantization with known techniques. We systematically study this phenomena and find that these outlier channels emerge early in training, and that they occur more frequently in layers with residual streams. We then propose a simple strategy which regularizes a layer's inputs via quantization-aware training (QAT) and its outputs via activation kurtosis regularization. We show that regularizing both the inputs and outputs is crucial for preventing a model's "migrating" the difficulty in input quantization to the weights, which makes post-training quantization (PTQ) of weights more difficult. When combined with weight PTQ, we show that our approach can obtain a W4A4 model that performs competitively to the standard-precision W16A16 baseline.
翻译:我们考虑语言模型的精确量化问题,其中权重和激活均被均匀量化至每参数4比特——这是GPU硬件原生支持的最低比特宽度格式。在此背景下,核心挑战在于激活量化:已知语言模型包含异常通道,其平均值比其他通道高出数个数量级,这使得现有技术难以实现精确的低比特宽度量化。我们系统研究了这一现象,发现这些异常通道在训练早期出现,且在包含残差流的层中更为常见。随后我们提出一种简单策略:通过量化感知训练(QAT)正则化层输入,并通过激活峰度正则化约束其输出。研究表明,同时正则化输入和输出对于防止模型将输入量化的困难"迁移"至权重至关重要,否则将导致权重的训练后量化(PTQ)难度增加。当与权重PTQ结合时,我们的方法能够获得性能与标准精度W16A16基线相媲美的W4A4模型。