The powerful zero-shot generalization capabilities of vision-language models (VLMs) like CLIP have enabled new paradigms for safety-related tasks such as out-of-distribution (OOD) detection. However, additional aspects crucial for the computationally efficient and reliable deployment of CLIP are still overlooked. In particular, the impact of quantization on CLIP's performance beyond accuracy remains underexplored. This work presents a large-scale evaluation of quantization on CLIP models, assessing not only in-distribution accuracy but a comprehensive suite of reliability metrics and revealing counterintuitive results driven by pre-training source. We demonstrate that quantization consistently improves calibration for typically underconfident pre-trained models, while often degrading it for overconfident variants. Intriguingly, this degradation in calibration does not preclude gains in other reliability metrics; we find that OOD detection can still improve for these same poorly calibrated models. Furthermore, we identify specific quantization-aware training (QAT) methods that yield simultaneous gains in zero-shot accuracy, calibration, and OOD robustness, challenging the view of a strict efficiency-performance trade-off. These findings offer critical insights for navigating the multi-objective problem of deploying efficient, reliable, and robust VLMs by utilizing quantization beyond its conventional role.
翻译:像CLIP这样的视觉语言模型(VLM)强大的零样本泛化能力,为安全相关任务(如分布外(OOD)检测)开启了新范式。然而,对于CLIP实现计算高效且可靠部署至关重要的其他方面仍被忽视。特别是,量化对CLIP准确性之外性能的影响仍未得到充分探索。本研究对CLIP模型进行了大规模量化评估,不仅评估了分布内准确性,还评估了一套全面的可靠性指标,并揭示了由预训练数据源驱动的反直觉结果。我们证明,量化能持续改善通常欠自信的预训练模型的校准度,同时往往会降低过度自信变体的校准度。耐人寻味的是,这种校准度的降低并不妨碍其他可靠性指标的提升;我们发现,对于这些同样校准不佳的模型,OOD检测性能仍可得到改善。此外,我们确定了特定的量化感知训练(QAT)方法,能在零样本准确性、校准度和OOD鲁棒性上同时取得增益,挑战了严格的效率-性能权衡的传统观点。这些发现为通过利用量化超越其传统角色,以应对部署高效、可靠且鲁棒的VLM这一多目标问题提供了关键见解。