Recent advancements in Large Language Models (LLMs) have facilitated the development of Multimodal LLMs (MLLMs). Despite their impressive capabilities, MLLMs often suffer from an over-reliance on unimodal biases (e.g., language bias and vision bias), leading to incorrect answers in complex multimodal tasks. To investigate this issue, we propose a causal framework to interpret the biases in Visual Question Answering (VQA) problems. Within our framework, we devise a causal graph to elucidate the predictions of MLLMs on VQA problems, and assess the causal effect of biases through an in-depth causal analysis. Motivated by the causal graph, we introduce a novel MORE dataset, consisting of 12,000 VQA instances. This dataset is designed to challenge MLLMs' abilities, necessitating multi-hop reasoning and the surmounting of unimodal biases. Furthermore, we propose two strategies to mitigate unimodal biases and enhance MLLMs' reasoning capabilities, including a Decompose-Verify-Answer (DeVA) framework for limited-access MLLMs and the refinement of open-source MLLMs through fine-tuning. Extensive quantitative and qualitative experiments offer valuable insights for future research.
翻译:近期大语言模型(LLMs)的进展推动了多模态大语言模型(MLLMs)的发展。尽管MLLMs表现出令人印象深刻的能力,但它们常常过度依赖单模态偏见(例如语言偏见和视觉偏见),导致在复杂多模态任务中给出错误答案。为探究这一问题,我们提出一个因果框架来解释视觉问答(VQA)问题中的偏见。在该框架内,我们设计了一个因果图来阐明MLLMs在VQA问题上的预测,并通过深入的因果分析评估偏见的因果效应。受因果图启发,我们引入了一个新的MORE数据集,包含12,000个VQA实例。该数据集旨在挑战MLLMs的能力,要求其进行多跳推理并克服单模态偏见。此外,我们提出了两种策略来缓解单模态偏见并增强MLLMs的推理能力,包括针对有限访问权限MLLMs的解耦-验证-回答(DeVA)框架以及通过微调改进开源MLLMs。大量的定量和定性实验为未来研究提供了有价值的见解。