Reasoning in the real world is not divorced from situations. How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and challenging for machine intelligence. This paper introduces a new benchmark that evaluates the situated reasoning ability via situation abstraction and logic-grounded question answering for real-world videos, called Situated Reasoning in Real-World Videos (STAR Benchmark). This benchmark is built upon the real-world videos associated with human actions or interactions, which are naturally dynamic, compositional, and logical. The dataset includes four types of questions, including interaction, sequence, prediction, and feasibility. We represent the situations in real-world videos by hyper-graphs connecting extracted atomic entities and relations (e.g., actions, persons, objects, and relationships). Besides visual perception, situated reasoning also requires structured situation comprehension and logical reasoning. Questions and answers are procedurally generated. The answering logic of each question is represented by a functional program based on a situation hyper-graph. We compare various existing video reasoning models and find that they all struggle on this challenging situated reasoning task. We further propose a diagnostic neuro-symbolic model that can disentangle visual perception, situation abstraction, language understanding, and functional reasoning to understand the challenges of this benchmark.
翻译:现实世界中的推理无法脱离情境。如何从周围情境中捕获当前知识并据此进行推理,对于机器智能而言至关重要且极具挑战性。本文提出一个新基准,通过情境抽象和基于逻辑的问答来评估真实世界视频的情境推理能力,称为"真实世界视频中的情境推理"(STAR基准)。该基准构建于与人类动作或交互相关的真实世界视频之上,这些视频天然具有动态性、组合性和逻辑性。数据集包含四类问题:交互、序列、预测和可行性。我们通过超图连接所提取的原子实体与关系(如动作、人物、物体和关系),以此表示真实世界视频中的情境。除视觉感知外,情境推理还需要结构化的情境理解与逻辑推理。问题和答案通过程序化方式生成,每个问题的解答逻辑均基于情境超图以函数式程序表示。我们对比了多种现有视频推理模型,发现它们在此具有挑战性的情境推理任务上均表现不佳。进一步地,我们提出一种诊断性神经符号模型,该模型可分离视觉感知、情境抽象、语言理解与函数推理,以深入理解此基准中的挑战。