FLAIR: VLM with Fine-grained Language-informed Image Representations

CLIP has shown impressive results in aligning images and texts at scale. However, its ability to capture detailed visual features remains limited because CLIP matches images and texts at a global level. To address this issue, we propose FLAIR, Fine-grained Language-informed Image Representations, an approach that utilizes long and detailed image descriptions to learn localized image embeddings. By sampling diverse sub-captions that describe fine-grained details about an image, we train our vision-language model to produce not only global embeddings but also text-specific image representations. Our model introduces text-conditioned attention pooling on top of local image tokens to produce fine-grained image representations that excel at retrieving detailed image content. We achieve state-of-the-art performance on both, existing multimodal retrieval benchmarks, as well as, our newly introduced fine-grained retrieval task which evaluates vision-language models' ability to retrieve partial image content. Furthermore, our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information, including zero-shot semantic segmentation, outperforming models trained on billions of pairs. Code is available at https://github.com/ExplainableML/flair .

翻译：CLIP 在大规模图像与文本对齐任务中展现出令人瞩目的成果。然而，由于 CLIP 仅在全局层面进行图像与文本的匹配，其捕捉细节视觉特征的能力仍显不足。为解决这一问题，我们提出了 FLAIR（细粒度语言感知图像表征），该方法利用长而详细的图像描述来学习局部化的图像嵌入。通过对描述图像细粒度细节的多样化子标题进行采样，我们训练视觉语言模型不仅能生成全局嵌入，还能生成文本特定的图像表征。我们的模型在局部图像标记之上引入了文本条件注意力池化机制，以产生擅长检索细节图像内容的细粒度图像表征。我们在现有的多模态检索基准以及我们新引入的细粒度检索任务（该任务评估视觉语言模型检索部分图像内容的能力）上均实现了最先进的性能。此外，我们的实验表明，仅使用 3000 万图像-文本对训练的 FLAIR 在捕捉细粒度视觉信息（包括零样本语义分割）方面表现出色，其性能超越了基于数十亿对数据训练的模型。代码发布于 https://github.com/ExplainableML/flair。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日