Layer-wise PTQ is a promising technique for compressing large language models (LLMs), due to its simplicity and effectiveness without requiring retraining. However, recent progress in this area is saturating, underscoring the need to revisit its core limitations and explore further improvements. We address this challenge by identifying a key limitation of existing layer-wise PTQ methods: the growth of quantization errors across layers significantly degrades performance, particularly in low-bit regimes. To address this fundamental issue, we propose Quantization Error Propagation (QEP), a general, lightweight, and scalable framework that enhances layer-wise PTQ by explicitly propagating quantization errors and compensating for accumulated errors. QEP also offers a tunable propagation mechanism that prevents overfitting and controls computational overhead, enabling the framework to adapt to various architectures and resource budgets. Extensive experiments on several LLMs demonstrate that QEP-enhanced layer-wise PTQ achieves substantially higher accuracy than existing methods. Notably, the gains are most pronounced in the extremely low-bit quantization regime.
翻译:逐层后训练量化(PTQ)因其无需重新训练的简洁性和有效性,成为压缩大语言模型(LLM)的一种有前景的技术。然而,该领域近期的进展趋于饱和,这凸显了重新审视其核心局限并探索进一步改进的必要性。我们通过识别现有逐层PTQ方法的一个关键局限来应对这一挑战:量化误差在层间的增长会显著降低模型性能,尤其是在低比特量化场景下。为解决这一根本问题,我们提出了量化误差传播(QEP),这是一个通用、轻量且可扩展的框架,它通过显式传播量化误差并补偿累积误差来增强逐层PTQ。QEP还提供了一种可调的传播机制,以防止过拟合并控制计算开销,使该框架能够适应不同的架构和资源预算。在多个LLM上进行的大量实验表明,经QEP增强的逐层PTQ比现有方法实现了显著更高的精度。值得注意的是,在极低比特量化场景下,性能提升最为显著。