Vision-language models (VLMs), such as CLIP and SigLIP, have found remarkable success in classification, retrieval, and generative tasks. For this, VLMs deterministically map images and text descriptions to a joint latent space in which their similarity is assessed using the cosine similarity. However, a deterministic mapping of inputs fails to capture uncertainties over concepts arising from domain shifts when used in downstream tasks. In this work, we propose post-hoc uncertainty estimation in VLMs that does not require additional training. Our method leverages a Bayesian posterior approximation over the last layers in VLMs and analytically quantifies uncertainties over cosine similarities. We demonstrate its effectiveness for uncertainty quantification and support set selection in active learning. Compared to baselines, we obtain improved and well-calibrated predictive uncertainties, interpretable uncertainty estimates, and sample-efficient active learning. Our results show promise for safety-critical applications of large-scale models.
翻译:视觉语言模型(VLM),如CLIP和SigLIP,在分类、检索和生成任务中取得了显著成功。为此,VLM将图像和文本描述确定性地映射到一个联合潜在空间,并在该空间中使用余弦相似度评估它们的相似性。然而,当用于下游任务时,输入的确定性映射无法捕捉因领域偏移而产生的概念不确定性。在本工作中,我们提出了一种无需额外训练的后验不确定性估计方法,用于VLM。我们的方法利用VLM最后几层的贝叶斯后验近似,并解析地量化余弦相似度的不确定性。我们展示了该方法在主动学习中进行不确定性量化和支持集选择的有效性。与基线方法相比,我们获得了改进且校准良好的预测不确定性、可解释的不确定性估计以及样本高效的主动学习。我们的研究结果为大规模模型在安全关键应用中的使用展现了前景。