Are foundation models for computer vision good conformal predictors?

Recent advances in self-supervision and contrastive learning have brought the performance of foundation models to unprecedented levels in a variety of tasks. Fueled by this progress, these models are becoming the prevailing approach for a wide array of real-world vision problems, including risk-sensitive and high-stakes applications. However, ensuring safe deployment in these scenarios requires a more comprehensive understanding of their uncertainty modeling capabilities, which has received little attention. In this work, we delve into the behaviour of vision and vision-language foundation models under Conformal Prediction (CP), a statistical framework that provides theoretical guarantees of marginal coverage of the true class. Across extensive experiments including popular vision classification benchmarks, well-known foundation vision models, and three CP methods, our findings reveal that foundation models are well-suited for conformalization procedures, particularly those integrating Vision Transformers. We also show that calibrating the confidence predictions of these models, a popular strategy to improve their uncertainty quantification, actually leads to efficiency degradation of the conformal set on adaptive CP methods. Furthermore, few-shot adaptation of Vision-Language Models (VLMs) to downstream tasks, whose popularity is surging, enhances conformal scores compared to zero-shot predictions. Last, our empirical study exposes APS as particularly promising in the context of vision foundation models, as it does not violate the marginal coverage guarantees across multiple challenging, yet realistic scenarios.

翻译：自监督学习与对比学习的最新进展已将基础模型在各类任务中的性能推至前所未有的高度。在这一进步的推动下，这些模型正成为解决广泛现实世界视觉问题的主流方法，包括风险敏感和高风险应用。然而，要确保在这些场景中的安全部署，需要更全面地理解其不确定性建模能力，而这一点迄今鲜受关注。在本工作中，我们深入探究了视觉及视觉-语言基础模型在共形预测（Conformal Prediction, CP）框架下的行为——该统计框架为真实类别的边际覆盖提供了理论保证。通过涵盖流行视觉分类基准、知名视觉基础模型及三种CP方法的大量实验，我们的研究结果表明，基础模型（尤其是集成Vision Transformers的模型）非常适合进行共形化处理。我们还发现，校准这些模型的置信度预测（一种常用于改进其不确定性量化的策略）实际上会导致自适应CP方法中置信集合的效率下降。此外，视觉-语言模型（Vision-Language Models, VLMs）在下游任务中的少样本适应（其应用正日益普及）相较于零样本预测能提升共形分数。最后，我们的实证研究表明，在视觉基础模型背景下，自适应预测集（Adaptive Prediction Sets, APS）方法尤其具有前景，因其在多个具有挑战性且现实的场景中均未违反边际覆盖保证。