Multimodal alignment between language and vision is the fundamental topic in current vision-language model research. Contrastive Captioners (CoCa), as a representative method, integrates Contrastive Language-Image Pretraining (CLIP) and Image Caption (IC) into a unified framework, resulting in impressive results. CLIP imposes a bidirectional constraints on global representation of entire images and sentences. Although IC conducts an unidirectional image-to-text generation on local representation, it lacks any constraint on local text-to-image reconstruction, which limits the ability to understand images at a fine-grained level when aligned with texts. To achieve multimodal alignment from both global and local perspectives, this paper proposes Symmetrizing Contrastive Captioners (SyCoCa), which introduces bidirectional interactions on images and texts across the global and local representation levels. Specifically, we expand a Text-Guided Masked Image Modeling (TG-MIM) head based on ITC and IC heads. The improved SyCoCa can further leverage textual cues to reconstruct contextual images and visual cues to predict textual contents. When implementing bidirectional local interactions, the local contents of images tend to be cluttered or unrelated to their textual descriptions. Thus, we employ an attentive masking strategy to select effective image patches for interaction. Extensive experiments on five vision-language tasks, including image-text retrieval, image-captioning, visual question answering, and zero-shot/finetuned image classification, validate the effectiveness of our proposed method.
翻译:语言与视觉之间的多模态对齐是当前视觉-语言模型研究的基础课题。对比字幕器(CoCa)作为一种代表性方法,将对比语言-图像预训练(CLIP)与图像字幕生成(IC)整合至统一框架中,取得了显著成果。CLIP对整张图像和完整句子的全局表征施加双向约束,而IC虽在局部表征上实现单向图像到文本的生成,却缺乏局部文本到图像重建的约束,这限制了在与文本对齐时对图像的细粒度理解能力。为实现全局与局部视角的多模态对齐,本文提出对称化对比字幕器(SyCoCa),在全局与局部表征层级上引入图像与文本的双向交互。具体而言,我们在ITC和IC头部基础上扩展了文本引导掩码图像建模(TG-MIM)头部。改进后的SyCoCa能进一步利用文本线索重建上下文图像,并利用视觉线索预测文本内容。在实现局部双向交互时,图像的局部内容易出现杂乱或与文本描述无关的情况,因此我们采用注意力掩码策略以筛选有效图像块进行交互。在五项视觉-语言任务(包括图像-文本检索、图像字幕生成、视觉问答、零样本/微调图像分类)上的广泛实验验证了所提方法的有效性。