Multimodal counterfactual reasoning is a vital yet challenging ability for AI systems. It involves predicting the outcomes of hypothetical circumstances based on vision and language inputs, which enables AI models to learn from failures and explore hypothetical scenarios. Despite its importance, there are only a few datasets targeting the counterfactual reasoning abilities of multimodal models. Among them, they only cover reasoning over synthetic environments or specific types of events (e.g. traffic collisions), making them hard to reliably benchmark the model generalization ability in diverse real-world scenarios and reasoning dimensions. To overcome these limitations, we develop a video question answering dataset, ACQUIRED: it consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints, which ensures a focus on real-world diversity. In addition, each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal, which can comprehensively evaluate the model counterfactual abilities along multiple aspects. We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap (>13%) between models and humans. The findings suggest that multimodal counterfactual reasoning remains an open challenge and ACQUIRED is a comprehensive and reliable benchmark for inspiring future research in this direction.
翻译:多模态反事实推理是人工智能系统一项至关重要但具有挑战性的能力,它涉及基于视觉和语言输入预测假设情境的结果,使AI模型能够从失败中学习并探索假设场景。尽管其重要性,目前仅有少量数据集针对多模态模型的反事实推理能力。其中,这些数据集仅覆盖对合成环境或特定类型事件(如交通事故)的推理,难以在多样化真实场景和推理维度中可靠地评估模型的泛化能力。为克服这些局限,我们开发了一个视频问答数据集ACQUIRED:它包含3.9万个标注视频,涵盖广泛的事件类型,并融合了第一人称与第三人称视角,以确保聚焦于真实世界的多样性。此外,每个视频都配有跨越物理、社会与时间三个不同推理维度的标注问题,可从多个方面全面评估模型的反事实能力。我们针对多个最先进的纯语言模型与多模态模型进行了基准测试,实验结果表明,模型与人类之间存在显著性能差距(>13%)。研究结果揭示,多模态反事实推理仍是一个开放挑战,而ACQUIRED是激励该方向未来研究的全面且可靠的基准数据集。