Probabilistic Language-Image Pre-Training

from arxiv, Code: https://github.com/naver-ai/prolip HuggingFace Hub: https://huggingface.co/collections/SanghyukChun/prolip-6712595dfc87fd8597350291 31 pages, 4.29 MB

Vision-language models (VLMs) embed aligned image-text pairs into a joint space but often rely on deterministic embeddings, assuming a one-to-one correspondence between images and texts. This oversimplifies real-world relationships, which are inherently many-to-many, with multiple captions describing a single image and vice versa. We introduce Probabilistic Language-Image Pre-training (ProLIP), the first probabilistic VLM pre-trained on a billion-scale image-text dataset using only probabilistic objectives, achieving a strong zero-shot capability (e.g., 74.6% ImageNet zero-shot accuracy with ViT-B/16). ProLIP efficiently estimates uncertainty by an "uncertainty token" without extra parameters. We also introduce a novel inclusion loss that enforces distributional inclusion relationships between image-text pairs and between original and masked inputs. Experiments demonstrate that, by leveraging uncertainty estimates, ProLIP benefits downstream tasks and aligns with intuitive notions of uncertainty, e.g., shorter texts being more uncertain and more general inputs including specific ones. Utilizing text uncertainties, we further improve ImageNet accuracy from 74.6% to 75.8% (under a few-shot setting), supporting the practical advantages of our probabilistic approach. The code is available at https://github.com/naver-ai/prolip

翻译：视觉语言模型（VLM）将对齐的图像-文本对嵌入到联合空间中，但通常依赖于确定性嵌入，假设图像与文本之间存在一一对应关系。这种假设过度简化了现实世界的关系，后者本质上是多对多的——单个图像可由多个描述对应，反之亦然。我们提出了概率语言-图像预训练（ProLIP），这是首个仅使用概率目标在十亿规模图像-文本数据集上预训练的概率VLM，实现了强大的零样本能力（例如，使用ViT-B/16在ImageNet上达到74.6%的零样本准确率）。ProLIP通过一个“不确定性标记”高效地估计不确定性，无需额外参数。我们还引入了一种新颖的包含损失，用于强制图像-文本对之间以及原始输入与掩码输入之间的分布包含关系。实验表明，通过利用不确定性估计，ProLIP有益于下游任务，并与直觉上的不确定性概念一致，例如较短的文本更具不确定性，更通用的输入包含更具体的输入。利用文本不确定性，我们进一步将ImageNet准确率从74.6%提升至75.8%（在少样本设置下），这支持了我们概率方法的实际优势。代码可在 https://github.com/naver-ai/prolip 获取。

相关内容

ImageNet (数据集)

关注 22

ImageNet项目是一个用于视觉对象识别软件研究的大型可视化数据库。超过1400万的图像URL被ImageNet手动注释，以指示图片中的对象;在至少一百万个图像中，还提供了边界框。ImageNet包含2万多个类别; [2]一个典型的类别，如“气球”或“草莓”，包含数百个图像。第三方图像URL的注释数据库可以直接从ImageNet免费获得;但是，实际的图像不属于ImageNet。自2010年以来，ImageNet项目每年举办一次软件比赛，即ImageNet大规模视觉识别挑战赛（ILSVRC），软件程序竞相正确分类检测物体和场景。 ImageNet挑战使用了一个“修剪”的1000个非重叠类的列表。2012年在解决ImageNet挑战方面取得了巨大的突破，被广泛认为是2010年的深度学习革命的开始。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日