Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of large language models (LLMs). However, the visual component typically depends only on the instance-level contrastive language-image pre-training (CLIP). Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings. To understand the roots of these errors, we explore the gap between the visual embedding space of CLIP and vision-only self-supervised learning. We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences. With these pairs, we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes areas where state-of-the-art systems, including GPT-4V, struggle with straightforward questions across nine basic visual patterns, often providing incorrect answers and hallucinated explanations. We further evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs. As an initial effort to address these issues, we propose a Mixture of Features (MoF) approach, demonstrating that integrating vision self-supervised learning features with MLLMs can significantly enhance their visual grounding capabilities. Together, our research suggests visual representation learning remains an open challenge, and accurate visual grounding is crucial for future successful multimodal systems.
翻译:摘要:视觉能力是否足以支撑语言理解?多模态模型的最新进展主要得益于大语言模型(LLMs)强大的推理能力。然而,其视觉组件通常仅依赖实例级对比语言-图像预训练(CLIP)。本研究发现,当前多模态大语言模型(MLLMs)的视觉能力仍存在系统性缺陷。为理解这些错误的根源,我们探索了CLIP视觉嵌入空间与仅依赖视觉的自监督学习之间的差距。我们识别出"CLIP盲对"——即尽管图像在视觉上存在明显差异,但CLIP仍将其视为相似。基于这些配对,我们构建了多模态视觉模式(MMVP)基准测试。MMVP揭示了现有最优系统(包括GPT-4V)在应对涉及九种基础视觉模式的直接问题时表现出的不足,常常给出错误答案和幻觉性解释。我们进一步评估了多种基于CLIP的视觉语言模型,发现困扰CLIP模型的视觉模式与导致多模态大语言模型出错的模式之间存在显著相关性。作为初步解决方案,我们提出了一种特征混合(MoF)方法,证明将视觉自监督学习特征集成到MLLMs中可显著增强其视觉基础能力。本研究共同表明:视觉表征学习仍是一个开放挑战,而准确的视觉基础能力对未来成功的多模态系统至关重要。