Large language models of high parameter counts are computationally expensive, yet can be made much more efficient by compressing their weights to very low numerical precision. This can be achieved either through post-training quantization by minimizing local, layer-wise quantization errors, or through quantization-aware fine-tuning by minimizing the global loss function. In this study, we discovered that, under the same data constraint, the former approach nearly always fared worse than the latter, a phenomenon particularly prominent when the numerical precision is very low. We further showed that this difficulty of post-training quantization arose from stark misalignment between optimization of the local and global objective functions. Our findings explains limited utility in minimization of local quantization error and the importance of direct quantization-aware fine-tuning, in the regime of large models at very low precision.
翻译:高参数数量的大型语言模型计算成本昂贵,但通过将其权重压缩至极低数值精度可显著提升效率。这可通过两种方式实现:一是通过最小化局部逐层量化误差进行训练后量化,二是通过最小化全局损失函数进行量化感知微调。本研究发现,在相同数据约束下,前者表现几乎总是逊于后者,这一现象在数值精度极低时尤为显著。我们进一步证明,训练后量化的困难源于局部目标函数与全局目标函数优化之间的严重错位。我们的发现解释了在极低精度的大型模型场景中,最小化局部量化误差的效用有限,而直接进行量化感知微调至关重要。