Quantization-aware training (QAT) is essential for extremely low-bit large language models (LLMs). Current QAT methods are mainly based on scalar quantization (SQ), which enables efficient optimization but suffers from severe performance degradation at 2-bit precision. On the other hand, vector quantization (VQ) provides substantially higher representational capacity, but its discrete codebook lookup prevents end-to-end training. We propose LC-QAT, a 2-bit weight-only VQ-QAT framework that represents quantized weights via a learned affine mapping over discrete vectors, which yields a high-quality PTQ initialization and enables fully differentiable end-to-end optimization without explicit codebook lookup in the training forward pass. This strong post-training initialization makes LC-QAT highly data-efficient. Experiments across diverse LLMs demonstrate that LC-QAT consistently outperforms state-of-the-art QAT methods while using only 0.1%--10% of the training data. Our results establish LC-QAT as a practical and scalable solution for extreme low-bit model deployment.
翻译:量化感知训练(QAT)对于超低比特大语言模型(LLM)至关重要。当前QAT方法主要基于标量量化(SQ),虽能实现高效优化,但在2比特精度下存在严重的性能退化。另一方面,向量量化(VQ)提供了显著更高的表示能力,但其离散码本查找机制阻碍了端到端训练。我们提出LC-QAT——一种2比特纯权重量化的VQ-QAT框架,通过离散向量上的学习仿射映射表示量化权重,既能获得高质量的PTQ初始化,又能在训练前向传播中避免显式码本查找,实现完全可微的端到端优化。这种强预训练初始化使得LC-QAT具有极高的数据效率。在多种LLM上的实验表明,LC-QAT在仅使用0.1%~10%训练数据的情况下,持续优于现有最先进的QAT方法。我们的结果确立了LC-QAT作为极端低比特模型部署中实用且可扩展的解决方案。