Vision-language models (VLMs) embed aligned image-text pairs into a joint space but often rely on deterministic embeddings, assuming a one-to-one correspondence between images and texts. This oversimplifies real-world relationships, which are inherently many-to-many, with multiple captions describing a single image and vice versa. We introduce Probabilistic Language-Image Pre-training (ProLIP), the first probabilistic VLM pre-trained on a billion-scale image-text dataset using only probabilistic objectives, achieving a strong zero-shot capability (e.g., 74.6% ImageNet zero-shot accuracy with ViT-B/16). ProLIP efficiently estimates uncertainty by an "uncertainty token" without extra parameters. We also introduce a novel inclusion loss that enforces distributional inclusion relationships between image-text pairs and between original and masked inputs. Experiments demonstrate that, by leveraging uncertainty estimates, ProLIP benefits downstream tasks and aligns with intuitive notions of uncertainty, e.g., shorter texts being more uncertain and more general inputs including specific ones. Utilizing text uncertainties, we further improve ImageNet accuracy from 74.6% to 75.8% (under a few-shot setting), supporting the practical advantages of our probabilistic approach. The code is available at https://github.com/naver-ai/prolip
翻译:视觉语言模型(VLM)将对齐的图像-文本对嵌入到联合空间中,但通常依赖于确定性嵌入,假设图像与文本之间存在一一对应关系。这种假设过度简化了现实世界的关系,后者本质上是多对多的——单个图像可由多个描述对应,反之亦然。我们提出了概率语言-图像预训练(ProLIP),这是首个仅使用概率目标在十亿规模图像-文本数据集上预训练的概率VLM,实现了强大的零样本能力(例如,使用ViT-B/16在ImageNet上达到74.6%的零样本准确率)。ProLIP通过一个“不确定性标记”高效地估计不确定性,无需额外参数。我们还引入了一种新颖的包含损失,用于强制图像-文本对之间以及原始输入与掩码输入之间的分布包含关系。实验表明,通过利用不确定性估计,ProLIP有益于下游任务,并与直觉上的不确定性概念一致,例如较短的文本更具不确定性,更通用的输入包含更具体的输入。利用文本不确定性,我们进一步将ImageNet准确率从74.6%提升至75.8%(在少样本设置下),这支持了我们概率方法的实际优势。代码可在 https://github.com/naver-ai/prolip 获取。