Causal reasoning is fundamental to human intelligence and crucial for effective decision-making in real-world environments. Despite recent advancements in large vision-language models (LVLMs), their ability to comprehend causality remains unclear. Previous work typically focuses on commonsense causality between events and/or actions, which is insufficient for applications like embodied agents and lacks the explicitly defined causal graphs required for formal causal reasoning. To overcome these limitations, we introduce a fine-grained and unified definition of causality involving interactions between humans and/or objects. Building on the definition, we construct a novel dataset, CELLO, consisting of 14,094 causal questions across all four levels of causality: discovery, association, intervention, and counterfactual. This dataset surpasses traditional commonsense causality by including explicit causal graphs that detail the interactions between humans and objects. Extensive experiments on CELLO reveal that current LVLMs still struggle with causal reasoning tasks, but they can benefit significantly from our proposed CELLO-CoT, a causally inspired chain-of-thought prompting strategy. Both quantitative and qualitative analyses from this study provide valuable insights for future research. Our project page is at https://github.com/OpenCausaLab/CELLO.
翻译:因果推理是人类智能的基础,对于现实环境中的有效决策至关重要。尽管大型视觉语言模型(LVLMs)近期取得了进展,但其理解因果关系的能力仍不明确。先前的研究通常集中于事件和/或动作之间的常识性因果关系,这对于具身智能体等应用而言尚不充分,且缺乏形式化因果推理所需的明确定义因果图。为克服这些局限性,我们提出了一个涉及人与/或物体间交互的细粒度且统一的因果关系定义。基于此定义,我们构建了一个新颖的数据集CELLO,包含覆盖因果发现、关联、干预和反事实全部四个层次的14,094个因果问题。该数据集通过包含详细描述人与物体交互的显式因果图,超越了传统的常识因果关系。在CELLO上进行的大量实验表明,当前的LVLMs在因果推理任务上仍然存在困难,但它们可以从我们提出的因果启发式思维链提示策略CELLO-CoT中显著受益。本研究的定量和定性分析为未来研究提供了宝贵见解。项目页面位于 https://github.com/OpenCausaLab/CELLO。