Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on standard visual reasoning benchmarks. However, there is growing concern that these models rely excessively on linguistic shortcuts rather than genuine visual grounding, a phenomenon we term Text Bias. In this paper, we investigate the fundamental tension between visual perception and linguistic priors. We decouple the sources of this bias into two dimensions: Internal Corpus Bias, stemming from statistical correlations in pretraining, and External Instruction Bias, arising from the alignment-induced tendency toward sycophancy. To quantify this effect, we introduce V-FAT (Visual Fidelity Against Text-bias), a diagnostic benchmark comprising 4,026 VQA instances across six semantic domains. V-FAT employs a Three-Level Evaluation Framework that systematically increases the conflict between visual evidence and textual information: (L1) internal bias from atypical images, (L2) external bias from misleading instructions, and (L3) synergistic bias where both coincide. We introduce the Visual Robustness Score (VRS), a metric designed to penalize "lucky" linguistic guesses and reward true visual fidelity. Our evaluation of 12 frontier MLLMs reveals that while models excel in existing benchmarks, they experience significant visual collapse under high linguistic dominance.
翻译:近年来,多模态大语言模型(MLLMs)在标准视觉推理基准测试中展现出令人印象深刻的性能。然而,人们日益担忧这些模型过度依赖语言捷径而非真正的视觉基础,我们将此现象称为文本偏见。本文研究了视觉感知与语言先验之间的根本性张力。我们将这种偏见的来源解耦为两个维度:内部语料库偏见,源于预训练中的统计相关性;以及外部指令偏见,源于对齐过程导致的迎合倾向。为量化此效应,我们引入了V-FAT(视觉保真度对抗文本偏见),这是一个包含六个语义领域共4,026个VQA实例的诊断性基准。V-FAT采用三级评估框架,系统地增加视觉证据与文本信息之间的冲突:(L1)来自非典型图像的内部偏见,(L2)来自误导性指令的外部偏见,以及(L3)两者同时出现的协同偏见。我们引入了视觉鲁棒性分数(VRS),该指标旨在惩罚“侥幸”的语言猜测并奖励真正的视觉保真度。我们对12个前沿MLLMs的评估表明,尽管模型在现有基准测试中表现出色,但在高语言主导性下均出现了显著的视觉崩溃。