What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models

Counterfactual reasoning ability is one of the core abilities of human intelligence. This reasoning process involves the processing of alternatives to observed states or past events, and this process can improve our ability for planning and decision-making. In this work, we focus on benchmarking the counterfactual reasoning ability of multi-modal large language models. We take the question and answer pairs from the VQAv2 dataset and add one counterfactual presupposition to the questions, with the answer being modified accordingly. After generating counterfactual questions and answers using ChatGPT, we manually examine all generated questions and answers to ensure correctness. Over 2k counterfactual question and answer pairs are collected this way. We evaluate recent vision language models on our newly collected test dataset and found that all models exhibit a large performance drop compared to the results tested on questions without the counterfactual presupposition. This result indicates that there still exists space for developing vision language models. Apart from the vision language models, our proposed dataset can also serves as a benchmark for evaluating the ability of code generation LLMs, results demonstrate a large gap between GPT-4 and current open-source models. Our code and dataset are available at \url{https://github.com/Letian2003/C-VQA}.

翻译：反事实推理能力是人类智能的核心能力之一。这一推理过程涉及对观察状态或过去事件替代可能性的处理，并且能够提升我们的规划与决策能力。本研究聚焦于对多模态大语言模型反事实推理能力的基准测试。我们从VQAv2数据集中获取问答对，在问题中加入一个反事实前提，并相应修改答案。通过使用ChatGPT生成反事实问题与答案后，我们人工审查所有生成的问答对以确保正确性，最终收集了超过2000个反事实问答对。我们在新收集的测试数据集上评估了近期视觉语言模型，发现与在不含反事实前提的问题上测试的结果相比，所有模型均表现出显著的性能下降。这一结果表明视觉语言模型仍有发展空间。除视觉语言模型外，我们提出的数据集还可作为评估代码生成大语言模型能力的基准，结果显示GPT-4与当前开源模型之间存在较大差距。我们的代码和数据集发布在\url{https://github.com/Letian2003/C-VQA}。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/