Counterfactual reasoning ability is one of the core abilities of human intelligence. This reasoning process involves the processing of alternatives to observed states or past events, and this process can improve our ability for planning and decision-making. In this work, we focus on benchmarking the counterfactual reasoning ability of multi-modal large language models. We take the question and answer pairs from the VQAv2 dataset and add one counterfactual presupposition to the questions, with the answer being modified accordingly. After generating counterfactual questions and answers using ChatGPT, we manually examine all generated questions and answers to ensure correctness. Over 2k counterfactual question and answer pairs are collected this way. We evaluate recent vision language models on our newly collected test dataset and found that all models exhibit a large performance drop compared to the results tested on questions without the counterfactual presupposition. This result indicates that there still exists space for developing vision language models. Apart from the vision language models, our proposed dataset can also serves as a benchmark for evaluating the ability of code generation LLMs, results demonstrate a large gap between GPT-4 and current open-source models. Our code and dataset are available at \url{https://github.com/Letian2003/C-VQA}.
翻译:反事实推理能力是人类智能的核心能力之一。这一推理过程涉及对观察状态或过去事件替代可能性的处理,并且能够提升我们的规划与决策能力。本研究聚焦于对多模态大语言模型反事实推理能力的基准测试。我们从VQAv2数据集中获取问答对,在问题中加入一个反事实前提,并相应修改答案。通过使用ChatGPT生成反事实问题与答案后,我们人工审查所有生成的问答对以确保正确性,最终收集了超过2000个反事实问答对。我们在新收集的测试数据集上评估了近期视觉语言模型,发现与在不含反事实前提的问题上测试的结果相比,所有模型均表现出显著的性能下降。这一结果表明视觉语言模型仍有发展空间。除视觉语言模型外,我们提出的数据集还可作为评估代码生成大语言模型能力的基准,结果显示GPT-4与当前开源模型之间存在较大差距。我们的代码和数据集发布在\url{https://github.com/Letian2003/C-VQA}。