Elastic precision quantization enables multi-bit deployment via a single optimization pass, fitting diverse quantization scenarios.Yet, the high storage and optimization costs associated with the Transformer architecture, research on elastic quantization remains limited, particularly for large language models.This paper proposes QuEPT, an efficient post-training scheme that reconstructs block-wise multi-bit errors with one-shot calibration on a small data slice. It can dynamically adapt to various predefined bit-widths by cascading different low-rank adapters, and supports real-time switching between uniform quantization and mixed precision quantization without repeated optimization. To enhance accuracy and robustness, we introduce Multi-Bit Token Merging (MB-ToMe) to dynamically fuse token features across different bit-widths, improving robustness during bit-width switching. Additionally, we propose Multi-Bit Cascaded Low-Rank adapters (MB-CLoRA) to strengthen correlations between bit-width groups, further improve the overall performance of QuEPT. Extensive experiments demonstrate that QuEPT achieves comparable or better performance to existing state-of-the-art post-training quantization methods.Our code is available at https://github.com/xuke225/QuEPT
翻译:弹性精度量化通过单次优化过程实现多比特部署,适用于多样化的量化场景。然而,由于Transformer架构的高存储与优化成本,针对弹性量化的研究仍然有限,尤其对于大语言模型而言。本文提出QuEPT,一种高效的后训练量化方案,通过在小规模数据切片上进行单次校准,重构块级多比特误差。该方法可通过级联不同的低秩适配器动态适应多种预定义比特宽度,并支持在均匀量化与混合精度量化之间实时切换,无需重复优化。为提升精度与鲁棒性,我们引入多比特令牌融合(MB-ToMe)机制,动态聚合不同比特宽度下的令牌特征,增强比特宽度切换过程中的鲁棒性。此外,我们提出多比特级联低秩适配器(MB-CLoRA),以强化不同比特宽度组间的关联性,进一步提升QuEPT的整体性能。大量实验表明,QuEPT在性能上达到或超越了现有最先进的后训练量化方法。我们的代码已发布于 https://github.com/xuke225/QuEPT