Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, CLIP has been widely adopted as the vision backbone of multimodal large language models (MLLMs) to connect image inputs for language interactions. The success of CLIP as a vision-language foundation model relies on aligning web-crawled noisy text annotations at image levels. Nevertheless, such criteria may become insufficient for downstream tasks in need of fine-grained vision representations, especially when region-level understanding is demanding for MLLMs. In this paper, we improve the localization capability of CLIP with several advances. We propose a pre-training method called Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text pseudo-labels at scale. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.
翻译:对比式语言-图像预训练(CLIP)已成为训练视觉编码器以生成支持多种应用的图像/文本表征的经典方法。近期,CLIP被广泛采用作为多模态大语言模型(MLLMs)的视觉骨干网络,以连接图像输入进行语言交互。CLIP作为视觉-语言基础模型的成功依赖于在图像级别对齐网络爬取的噪声文本标注。然而,对于需要细粒度视觉表征的下游任务,尤其是当MLLMs需要区域级理解时,此类标准可能变得不足。本文通过多项改进提升CLIP的定位能力。我们提出一种称为对比式局部化语言-图像预训练(CLOC)的预训练方法,通过为CLIP补充区域-文本对比损失与模块实现创新。我们构建了一个新概念——可提示嵌入,其编码器生成的图像嵌入能够根据空间提示轻松转换为区域表征。为支持大规模预训练,我们设计了视觉增强与空间局部化的标注框架,以高效生成大规模区域-文本伪标签。通过将标注图像规模扩展至数十亿,CLOC能够为图像区域识别与检索任务提供高质量区域嵌入,并可作为CLIP的直接替代方案增强MLLMs,尤其在指代与 grounding 任务中表现突出。