Counterfactual reasoning, as a crucial manifestation of human intelligence, refers to making presuppositions based on established facts and extrapolating potential outcomes. Existing multimodal large language models (MLLMs) have exhibited impressive cognitive and reasoning capabilities, which have been examined across a wide range of Visual Question Answering (VQA) benchmarks. Nevertheless, how will existing MLLMs perform when faced with counterfactual questions? To answer this question, we first curate a novel \textbf{C}ounter\textbf{F}actual \textbf{M}ulti\textbf{M}odal reasoning benchmark, abbreviated as \textbf{CFMM}, to systematically assess the counterfactual reasoning capabilities of MLLMs. Our CFMM comprises six challenging tasks, each including hundreds of carefully human-labeled counterfactual questions, to evaluate MLLM's counterfactual reasoning capabilities across diverse aspects. Through experiments, interestingly, we find that existing MLLMs prefer to believe what they see, but ignore the counterfactual presuppositions presented in the question, thereby leading to inaccurate responses. Furthermore, we evaluate a wide range of prevalent MLLMs on our proposed CFMM. The significant gap between their performance on our CFMM and that on several VQA benchmarks indicates that there is still considerable room for improvement in existing MLLMs toward approaching human-level intelligence. On the other hand, through boosting MLLMs performances on our CFMM in the future, potential avenues toward developing MLLMs with advanced intelligence can be explored.
翻译:反事实推理作为人类智能的重要体现,指基于既定事实做出假设并推演可能结果。现有多模态大语言模型(MLLMs)已展现出令人瞩目的认知与推理能力,这一能力在各类视觉问答(VQA)基准测试中得到了广泛验证。然而,当面对反事实问题时,现有MLLMs的表现如何?为回答此问题,我们首先构建了一个新的\textbf{C}ounter\textbf{F}actual \textbf{M}ulti\textbf{M}odal推理基准,简称\textbf{CFMM},用以系统评估MLLMs的反事实推理能力。CFMM包含六个具有挑战性的任务,每个任务包含数百个经过人工精心标注的反事实问题,旨在从不同维度评估MLLMs的反事实推理能力。实验发现,现有MLLMs倾向于相信所见的视觉信息,却忽视问题中呈现的反事实前提,从而导致错误的回答。进一步地,我们在CFMM上评估了多种主流MLLMs。这些模型在CFMM上的表现与多个VQA基准测试结果之间的显著差距表明,现有MLLMs在接近人类智能水平方面仍有较大提升空间。另一方面,通过未来在CFMM上提升MLLMs的表现,有望探索开发具有高级智能的MLLMs的新途径。