CLIP has shown impressive results in aligning images and texts at scale. However, its ability to capture detailed visual features remains limited because CLIP matches images and texts at a global level. To address this issue, we propose FLAIR, Fine-grained Language-informed Image Representations, an approach that utilizes long and detailed image descriptions to learn localized image embeddings. By sampling diverse sub-captions that describe fine-grained details about an image, we train our vision-language model to produce not only global embeddings but also text-specific image representations. Our model introduces text-conditioned attention pooling on top of local image tokens to produce fine-grained image representations that excel at retrieving detailed image content. We achieve state-of-the-art performance on both, existing multimodal retrieval benchmarks, as well as, our newly introduced fine-grained retrieval task which evaluates vision-language models' ability to retrieve partial image content. Furthermore, our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information, including zero-shot semantic segmentation, outperforming models trained on billions of pairs. Code is available at https://github.com/ExplainableML/flair .
翻译:CLIP 在大规模图像与文本对齐任务中展现出令人瞩目的成果。然而,由于 CLIP 仅在全局层面进行图像与文本的匹配,其捕捉细节视觉特征的能力仍显不足。为解决这一问题,我们提出了 FLAIR(细粒度语言感知图像表征),该方法利用长而详细的图像描述来学习局部化的图像嵌入。通过对描述图像细粒度细节的多样化子标题进行采样,我们训练视觉语言模型不仅能生成全局嵌入,还能生成文本特定的图像表征。我们的模型在局部图像标记之上引入了文本条件注意力池化机制,以产生擅长检索细节图像内容的细粒度图像表征。我们在现有的多模态检索基准以及我们新引入的细粒度检索任务(该任务评估视觉语言模型检索部分图像内容的能力)上均实现了最先进的性能。此外,我们的实验表明,仅使用 3000 万图像-文本对训练的 FLAIR 在捕捉细粒度视觉信息(包括零样本语义分割)方面表现出色,其性能超越了基于数十亿对数据训练的模型。代码发布于 https://github.com/ExplainableML/flair。