ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data

Vision-language models (VLMs), such as CLIP and ALIGN, are generally trained on datasets consisting of image-caption pairs obtained from the web. However, real-world multimodal datasets, such as healthcare data, are significantly more complex: each image (e.g. X-ray) is often paired with text (e.g. physician report) that describes many distinct attributes occurring in fine-grained regions of the image. We refer to these samples as exhibiting high pairwise complexity, since each image-text pair can be decomposed into a large number of region-attribute pairings. The extent to which VLMs can capture fine-grained relationships between image regions and textual attributes when trained on such data has not been previously evaluated. The first key contribution of this work is to demonstrate through systematic evaluations that as the pairwise complexity of the training dataset increases, standard VLMs struggle to learn region-attribute relationships, exhibiting performance degradations of up to 37% on retrieval tasks. In order to address this issue, we introduce ViLLA as our second key contribution. ViLLA, which is trained to capture fine-grained region-attribute relationships from complex datasets, involves two components: (a) a lightweight, self-supervised mapping model to decompose image-text samples into region-attribute pairs, and (b) a contrastive VLM to learn representations from generated region-attribute pairs. We demonstrate with experiments across four domains (synthetic, product, medical, and natural images) that ViLLA outperforms comparable VLMs on fine-grained reasoning tasks, such as zero-shot object detection (up to 3.6 AP50 points on COCO and 0.6 mAP points on LVIS) and retrieval (up to 14.2 R-Precision points).

翻译：视觉-语言模型（如CLIP和ALIGN）通常基于从网络获取的图像-文本对数据集进行训练。然而，真实世界的多模态数据集（如医疗数据）结构更为复杂：每张图像（如X光片）通常配有一段文本（如医生报告），该文本描述了图像细粒度区域中出现的多种不同属性。我们将此类样本定义为具有高成对复杂度——因为每个图像-文本对可分解为大量区域-属性对。此前尚未评估过视觉-语言模型在此类数据训练时捕捉图像区域与文本属性间细粒度关系的能力。本文的首要贡献在于：通过系统性评估证明，随着训练数据集成对复杂度增加，标准视觉-语言模型难以学习区域-属性关系，在检索任务中性能降幅高达37%。为解决此问题，我们提出第二项核心贡献——ViLLA模型。ViLLA专为从复杂数据集中捕捉细粒度区域-属性关系而设计，包含两个组件：（a）轻量级自监督映射模型，用于将图像-文本样本分解为区域-属性对；（b）对比视觉-语言模型，用于从生成的区域-属性对中学习表征。我们在四个领域（合成图像、产品图像、医学图像和自然图像）的实验表明，ViLLA在细粒度推理任务中优于同类视觉-语言模型，例如零样本目标检测（COCO数据集上AP50提升达3.6点，LVIS数据集上mAP提升达0.6点）和检索任务（R-Precision提升达14.2点）。