X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs

Recent advancements in Multimodal Large Language Models (MLLMs) have revolutionized the field of vision-language understanding by integrating visual perception capabilities into Large Language Models (LLMs). The prevailing trend in this field involves the utilization of a vision encoder derived from vision-language contrastive learning (CL), showing expertise in capturing overall representations while facing difficulties in capturing detailed local patterns. In this work, we focus on enhancing the visual representations for MLLMs by combining high-frequency and detailed visual representations, obtained through masked image modeling (MIM), with semantically-enriched low-frequency representations captured by CL. To achieve this goal, we introduce X-Former which is a lightweight transformer module designed to exploit the complementary strengths of CL and MIM through an innovative interaction mechanism. Specifically, X-Former first bootstraps vision-language representation learning and multimodal-to-multimodal generative learning from two frozen vision encoders, i.e., CLIP-ViT (CL-based) and MAE-ViT (MIM-based). It further bootstraps vision-to-language generative learning from a frozen LLM to ensure visual features from X-Former can be interpreted by the LLM. To demonstrate the effectiveness of our approach, we assess its performance on tasks demanding detailed visual understanding. Extensive evaluations indicate that X-Former excels in visual reasoning tasks involving both structural and semantic categories in the GQA dataset. Assessment on fine-grained visual perception benchmark further confirms its superior capabilities in visual understanding.

翻译：近年来，多模态大语言模型（MLLMs）通过将视觉感知能力集成到大语言模型（LLMs）中，彻底改变了视觉-语言理解领域。该领域的主流趋势是利用源自视觉-语言对比学习（CL）的视觉编码器，这类编码器擅长捕捉整体表征，但在捕获细粒度局部模式方面存在困难。本研究旨在通过将掩码图像建模（MIM）获得的高频细节视觉表征与CL捕获的语义增强低频表征相结合，从而增强MLLMs的视觉表征能力。为实现这一目标，我们提出了X-Former——一个轻量级Transformer模块，通过创新的交互机制融合CL与MIM的互补优势。具体而言，X-Former首先从两个冻结的视觉编码器（即基于CL的CLIP-ViT与基于MIM的MAE-ViT）中引导视觉-语言表征学习与多模态到多模态的生成式学习；进而从冻结的LLM中引导视觉到语言的生成式学习，确保X-Former产生的视觉特征能被LLM有效解析。为验证方法的有效性，我们在需要精细视觉理解的任务上评估其性能。大量实验表明，X-Former在GQA数据集中涉及结构性与语义类别的视觉推理任务上表现优异。在细粒度视觉感知基准上的评估进一步证实了其在视觉理解方面的卓越能力。