Activation outliers in large-scale transformer models pose a fundamental challenge to model quantization, creating excessively large ranges that cause severe accuracy drops during quantization. We empirically observe that outlier severity intensifies with pre-training scale (e.g., progressing from CLIP to the more extensively trained SigLIP and SigLIP2). Through theoretical analysis as well as empirical correlation studies, we establish the direct link between these activation outliers and dominant singular values of the weights. Building on this insight, we propose Selective Spectral Decay ($S^2D$), a geometrically-principled conditioning method that surgically regularizes only the weight components corresponding to the largest singular values during fine-tuning. Through extensive experiments, we demonstrate that $S^2D$ significantly reduces activation outliers and produces well-conditioned representations that are inherently quantization-friendly. Models trained with $S^2D$ achieve up to 7% improved PTQ accuracy on ImageNet under W4A4 quantization and 4% gains when combined with QAT. These improvements also generalize across downstream tasks and vision-language models, enabling the scaling of increasingly large and rigorously trained models without sacrificing deployment efficiency.
翻译:大规模Transformer模型中的激活异常值对模型量化构成了根本性挑战,其产生的极大数值范围会在量化过程中导致严重的精度下降。我们通过实证观察发现,异常值的严重程度随着预训练规模的扩大而加剧(例如从CLIP到训练更充分的SigLIP和SigLIP2)。通过理论分析和实证相关性研究,我们确立了这些激活异常值与权重矩阵主导奇异值之间的直接关联。基于这一发现,我们提出了选择性谱衰减($S^2D$)——一种基于几何原理的条件化方法,该方法在微调过程中仅针对最大奇异值对应的权重分量进行精准正则化。大量实验表明,$S^2D$能显著降低激活异常值,并生成具有良好条件性的表征,这些表征本质上对量化操作更为友好。采用$S^2D$训练的模型在W4A4量化条件下,于ImageNet数据集上实现了高达7%的后训练量化精度提升,与量化感知训练结合时亦可获得4%的增益。这些改进在下游任务和视觉-语言模型中同样具有泛化能力,使得在保持部署效率的前提下,能够扩展日益庞大且经过严格训练的模型规模。