Large-scale vision-language models (VLMs) like CLIP successfully find correspondences between images and text. Through the standard deterministic mapping process, an image or a text sample is mapped to a single vector in the embedding space. This is problematic: as multiple samples (images or text) can abstract the same concept in the physical world, deterministic embeddings do not reflect the inherent ambiguity in the embedding space. We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained VLMs via inter/intra-modal alignment in a post-hoc manner without needing large-scale datasets or computing. On four challenging datasets, i.e., COCO, Flickr, CUB, and Oxford-flowers, we estimate the multi-modal embedding uncertainties for two VLMs, i.e., CLIP and BLIP, quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods. Furthermore, we propose active learning and model selection as two real-world downstream tasks for VLMs and show that the estimated uncertainty aids both tasks. Lastly, we present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model. Code is available at https://github.com/ExplainableML/ProbVLM.
翻译:大规模视觉语言模型(如CLIP)能够成功建立图像与文本之间的对应关系。通过标准的确定性映射过程,图像或文本样本被映射到嵌入空间中的单个向量。这存在问题:由于多个样本(图像或文本)可抽象物理世界中的同一概念,确定性嵌入无法反映嵌入空间中的固有模糊性。我们提出ProbVLM——一种概率适配器,通过事后模态间/内对齐方式估计预训练视觉语言模型嵌入的概率分布,且无需大规模数据集或大量计算。在四个具有挑战性的数据集(COCO、Flickr、CUB和Oxford-flowers)上,我们估计了两种视觉语言模型(CLIP和BLIP)的多模态嵌入不确定性,量化了检索任务中嵌入不确定性的校准程度,并证明ProbVLM优于其他方法。此外,我们提出主动学习和模型选择作为视觉语言模型的两项实际下游任务,并表明所估计的不确定性对这两项任务均有助益。最后,我们提出一种基于大规模预训练潜在扩散模型可视化嵌入分布的新技术。代码地址:https://github.com/ExplainableML/ProbVLM。