Recent advancements in Large Language Models (LLMs) have facilitated the development of Multimodal LLMs (MLLMs). Despite their impressive capabilities, MLLMs often suffer from over-reliance on unimodal biases (e.g., language bias and vision bias), leading to incorrect answers or hallucinations in complex multimodal tasks. To investigate this issue, we propose a causal framework to interpret the biases in Visual Question Answering (VQA) problems. Within this framework, we conduct an in-depth causal analysis to assess the causal effect of these biases on MLLM predictions. Based on the analysis, we introduce 1) a novel MORE dataset with 12,000 challenging VQA instances requiring multi-hop reasoning and overcoming unimodal biases. 2) a causality-enhanced agent framework CAVE that guides models to comprehensively integrate information from different modalities and mitigate biases. Our experiments show that MLLMs perform poorly on MORE, indicating strong unimodal biases and limited semantic understanding. However, when integrated with our CAVE, promising improvements in reasoning and bias mitigation can be seen. These findings provide important insights for the development of more robust MLLMs and contribute to the broader goal of advancing multimodal AI systems capable of deeper understanding and reasoning. Our project page is at https://github.com/OpenCausaLab/MORE.
翻译:近年来,大语言模型(LLMs)的进展推动了多模态大语言模型(MLLMs)的发展。尽管MLLMs展现出令人印象深刻的能力,但它们常常过度依赖单模态偏见(例如语言偏见和视觉偏见),导致在复杂的多模态任务中给出错误答案或产生幻觉。为探究此问题,我们提出了一个因果框架来解释视觉问答(VQA)问题中的偏见。在此框架内,我们进行了深入的因果分析,以评估这些偏见对MLLM预测的因果效应。基于该分析,我们引入了:1)一个新颖的MORE数据集,包含12,000个具有挑战性的VQA实例,需要多跳推理并克服单模态偏见;2)一个因果增强的智能体框架CAVE,该框架引导模型全面整合来自不同模态的信息并缓解偏见。我们的实验表明,MLLMs在MORE数据集上表现不佳,显示出强烈的单模态偏见和有限的语义理解能力。然而,当与我们的CAVE框架结合时,可以看到在推理和偏见缓解方面有显著的提升。这些发现为开发更稳健的MLLMs提供了重要见解,并有助于推进能够实现更深层次理解和推理的多模态AI系统这一更广泛的目标。我们的项目页面位于 https://github.com/OpenCausaLab/MORE。