Vision-Language Models (VLMs) such as CLIP have revolutionized zero-shot classification and safety-critical tasks, including Out-of-Distribution (OOD) detection. However, their high computational cost hinders efficient real-world deployment. While quantization is a standard solution for efficiency, its broader impact on reliability metrics beyond simple Top-1 accuracy remains critically under-explored. In this study, we conduct a large-scale evaluation of VLM quantization across a comprehensive experimental suite of over 700k evaluation runs with varying configurations. We find that, contrary to the assumption that quantization's noise degrades performance, it can simultaneously improve accuracy, calibration, OOD detection, and robustness to noise, though not to covariate shift or spurious correlations. We leverage these counterintuitive findings to characterize the mechanics of quantization beyond simple regularization: we show that quantization dampens high-rank spectral components, compelling the model to rely more heavily on robust, low-rank features. Ultimately, this spectral filtering effect drives the observed improvements in generalization and noise tolerance, establishing a pathway to deploy faster, more reliable VLMs by utilizing quantization beyond its conventional role.
翻译:以CLIP为代表的视觉-语言模型(VLMs)彻底改变了零样本分类及包括分布外(OOD)检测在内的安全关键任务。然而,其高昂的计算成本阻碍了在实际场景中的高效部署。量化虽是提升效率的常规解决方案,但其对简单Top-1准确度之外更广泛可靠性指标的影响,仍严重缺乏深入探索。本研究通过涵盖超过70万次不同配置评估实验的大规模测试,对VLM量化进行了全面评估。我们发现,与量化噪声会降低性能的普遍假设相反,量化能同时提升准确度、校准性、OOD检测能力以及对噪声的鲁棒性——尽管对协变量偏移或伪相关性的鲁棒性并未改善。基于这些反直觉的发现,我们揭示了量化超越简单正则化作用的内在机制:量化抑制了高秩谱成分,迫使模型更依赖鲁棒的低秩特征。最终,这种谱滤波效应驱动了在泛化性与噪声容忍度方面观察到的改进,从而为通过发挥量化超越传统角色的作用,部署更快、更可靠的VLM开辟了路径。