Counterfactual reasoning, as a crucial manifestation of human intelligence, refers to making presuppositions based on established facts and extrapolating potential outcomes. Existing multimodal large language models (MLLMs) have exhibited impressive cognitive and reasoning capabilities, which have been examined across a wide range of Visual Question Answering (VQA) benchmarks. Nevertheless, how will existing MLLMs perform when faced with counterfactual questions? To answer this question, we first curate a novel \textbf{C}ounter\textbf{F}actual \textbf{M}ulti\textbf{M}odal reasoning benchmark, abbreviated as \textbf{CFMM}, to systematically assess the counterfactual reasoning capabilities of MLLMs. Our CFMM comprises six challenging tasks, each including hundreds of carefully human-labeled and GPT-generated counterfactual questions, to evaluate MLLM's counterfactual reasoning capabilities across diverse aspects. Through experiments, interestingly, we find that existing MLLMs prefer to believe what they see, but ignore the counterfactual presuppositions presented in the question, thereby leading to inaccurate responses. Furthermore, we evaluate a wide range of prevalent MLLMs on our proposed CFMM. The significant gap between their performance on our CFMM and that on several VQA benchmarks indicates that there is still considerable room for improvement in existing MLLMs toward approaching human-level intelligence. On the other hand, through boosting MLLMs performances on our CFMM in the future, potential avenues toward developing MLLMs with advanced intelligence can be explored.
翻译:反事实推理作为人类智能的重要体现,指基于既定事实提出假设并推断潜在结果。现有的多模态大语言模型(MLLMs)已展现出令人瞩目的认知与推理能力,这些能力在各类视觉问答(VQA)基准测试中得到了验证。然而,当面对反事实问题时,现有MLLMs将表现如何?为回答此问题,我们首先构建了一个新颖的**反事实多模态推理基准**(简称**CFMM**),以系统评估MLLMs的反事实推理能力。我们的CFMM包含六项挑战性任务,每项任务均包含数百个人工标注与GPT生成的反事实问题,用以从多维度评估MLLMs的反事实推理能力。实验发现一个有趣现象:现有MLLMs倾向于相信其所见内容,却忽视问题中呈现的反事实假设,从而导致回答失准。此外,我们在CFMM上评估了多种主流MLLMs。这些模型在CFMM与多个VQA基准上的表现存在显著差距,表明现有MLLMs在接近人类智能水平方面仍有较大提升空间。另一方面,通过未来提升MLLMs在CFMM上的表现,可为探索开发具有高级智能的MLLMs开辟潜在路径。