Quantization-aware training (QAT) is an effective method to drastically reduce the memory footprint of LLMs while keeping performance degradation at an acceptable level. However, the optimal choice of quantization format and bit-width presents a challenge in practice. The full design space of quantization is not fully explored in the context of QAT, and the precise trade-off between quantization and downstream performance is poorly understood, as comparisons often rely solely on perplexity-based evaluations. In this work, we address these shortcomings with an empirical study of QAT in the low-bit regime. We show that k-means based weight quantization outperforms integer formats and can be implemented efficiently on standard hardware. Furthermore, we find that, under a fixed inference memory budget, the best performance on generative downstream tasks is achieved with $1$-bit quantized weights.
翻译:量化感知训练(QAT)是一种有效方法,可在将性能损失控制在可接受水平的同时,显著降低大语言模型的内存占用。然而,量化格式与比特宽度的最优选择在实践中仍具挑战。在QAT背景下,完整的量化设计空间尚未被充分探索,且量化与下游性能之间的精确权衡关系尚不明确,因为现有比较往往仅依赖于基于困惑度的评估。本研究通过低比特区间的量化感知训练实证分析,针对上述不足展开探讨。我们证明基于k均值的权重量化优于整数格式,并可在标准硬件上高效实现。此外,我们发现,在固定推理内存预算下,生成式下游任务的最佳性能可通过$1$比特权重量化实现。