Vision-language models (VLMs) embed aligned image-text pairs into a joint space but often rely on deterministic embeddings, assuming a one-to-one correspondence between images and texts. This oversimplifies real-world relationships, which are inherently many-to-many, with multiple captions describing a single image and vice versa. We introduce Probabilistic Language-Image Pre-training (ProLIP), the first probabilistic VLM pre-trained on a billion-scale image-text dataset using only probabilistic objectives, achieving a strong zero-shot capability (e.g., 74.6% ImageNet zero-shot accuracy with ViT-B/16). ProLIP efficiently estimates uncertainty by an "uncertainty token" without extra parameters. We also introduce a novel inclusion loss that enforces distributional inclusion relationships between image-text pairs and between original and masked inputs. Experiments demonstrate that, by leveraging uncertainty estimates, ProLIP benefits downstream tasks and aligns with intuitive notions of uncertainty, e.g., shorter texts being more uncertain and more general inputs including specific ones. Utilizing text uncertainties, we further improve ImageNet accuracy from 74.6% to 75.8% (under a few-shot setting), supporting the practical advantages of our probabilistic approach. The code is available at https://github.com/naver-ai/prolip
翻译:视觉语言模型(VLMs)将对齐的图像-文本对嵌入到联合空间中,但通常依赖于确定性嵌入,假设图像与文本之间存在一一对应关系。这种简化过度处理了现实世界中固有的多对多关系,即单个图像可由多个描述文本对应,反之亦然。我们提出了概率语言-图像预训练(ProLIP),这是首个仅使用概率目标在十亿规模图像-文本数据集上预训练的概率视觉语言模型,实现了强大的零样本能力(例如,使用ViT-B/16在ImageNet上达到74.6%的零样本准确率)。ProLIP通过“不确定性标记”高效估计不确定性,无需额外参数。我们还引入了一种新颖的包含损失,该损失强制图像-文本对之间以及原始输入与掩码输入之间的分布包含关系。实验表明,通过利用不确定性估计,ProLIP有益于下游任务,并与不确定性的直观概念一致,例如较短文本具有更高不确定性,更通用的输入包含特定输入。利用文本不确定性,我们进一步将ImageNet准确率从74.6%提升至75.8%(在少样本设置下),这支持了我们概率方法的实际优势。代码可在https://github.com/naver-ai/prolip获取。