Large transformer models have demonstrated remarkable success. Post-training quantization (PTQ), which requires only a small dataset for calibration and avoids end-to-end retraining, is a promising solution for compressing these large models. Regrettably, existing PTQ methods typically exhibit non-trivial performance loss. We find that the performance bottleneck stems from over-consideration of hardware compatibility in the quantization process, compelling them to reluctantly employ simple quantizers, albeit at the expense of accuracy. With the above insights, we propose RepQuant, a novel PTQ framework with quantization-inference decoupling paradigm to address the above issues. RepQuant employs complex quantizers in the quantization process and simplified quantizers in the inference process, and performs mathematically equivalent transformations between the two through quantization scale reparameterization, thus ensuring both accurate quantization and efficient inference. More specifically, we focus on two components with extreme distributions: LayerNorm activations and Softmax activations. Initially, we apply channel-wise quantization and log$\sqrt{2}$ quantization, respectively, which are tailored to their distributions. In particular, for the former, we introduce a learnable per-channel dual clipping scheme, which is designed to efficiently identify outliers in the unbalanced activations with fine granularity. Then, we reparameterize the scales to hardware-friendly layer-wise quantization and log2 quantization for inference. Moreover, quantized weight reconstruction is seamlessly integrated into the above procedure to further push the performance limits. Extensive experiments are performed on different large-scale transformer variants on multiple tasks, including vision, language, and multi-modal transformers, and RepQuant encouragingly demonstrates significant performance advantages.
翻译:大型Transformer模型已展现出显著的成功。训练后量化(PTQ)仅需少量标定数据集,无需端到端重训练,是压缩这些大型模型的可行方案。遗憾的是,现有PTQ方法通常会出现明显的性能损失。我们发现性能瓶颈源于量化过程中过度考虑硬件兼容性,这迫使它们不得不采用简单的量化器,从而牺牲了精度。基于上述洞察,我们提出RepQuant——一种采用量化-推理解耦范式的新型PTQ框架来解决上述问题。RepQuant在量化过程中使用复杂量化器,在推理过程中使用简化量化器,并通过量化尺度重参数化在两者之间进行数学等价变换,从而同时确保精确量化与高效推理。具体而言,我们聚焦于两种具有极端分布的组件:LayerNorm激活和Softmax激活。首先,我们分别针对其分布特性应用通道级量化和log$\sqrt{2}$量化。特别地,对于前者,我们引入可学习的逐通道双裁剪方案,该方案能以细粒度高效识别非平衡激活中的异常值。随后,我们将尺度重参数化为硬件友好的层级量化和log2量化用于推理。此外,量化权重重建被无缝集成到上述流程中,以进一步突破性能极限。我们在多种任务上针对不同大规模Transformer变体(包括视觉、语言和多模态Transformer)进行了广泛实验,RepQuant令人鼓舞地展示了显著的性能优势。