Despite rapid progress, embodied agents still struggle with long-horizon manipulation that requires maintaining spatial consistency, causal dependencies, and goal constraints. A key limitation of existing approaches is that task reasoning is implicitly embedded in high-dimensional latent representations, making it challenging to separate task structure from perceptual variability. We introduce Grounded Scene-graph Reasoning (GSR), a structured reasoning paradigm that explicitly models world-state evolution as transitions over semantically grounded scene graphs. By reasoning step-wise over object states and spatial relations, rather than directly mapping perception to actions, GSR enables explicit reasoning about action preconditions, consequences, and goal satisfaction in a physically grounded space. To support learning such reasoning, we construct Manip-Cognition-1.6M, a large-scale dataset that jointly supervises world understanding, action planning, and goal interpretation. Extensive evaluations across RLBench, LIBERO, GSR-benchmark, and real-world robotic tasks show that GSR significantly improves zero-shot generalization and long-horizon task completion over prompting-based baselines. These results highlight explicit world-state representations as a key inductive bias for scalable embodied reasoning.
翻译:尽管发展迅速,具身智能体在执行需要保持空间一致性、因果依赖关系和目标约束的长时程操作任务时仍面临困难。现有方法的一个关键局限在于任务推理被隐式嵌入高维潜在表示中,这使得任务结构与感知可变性难以分离。我们提出基于场景图的接地推理范式,该结构化推理范式将世界状态演化显式建模为语义接地的场景图上的状态转移。通过逐步推理对象状态与空间关系,而非直接将感知映射为动作,GSR能够在物理接地空间中显式推理动作前提条件、执行结果及目标达成状态。为支持此类推理的学习,我们构建了Manip-Cognition-1.6M大规模数据集,该数据集联合监督世界理解、动作规划与目标解析。在RLBench、LIBERO、GSR基准测试及真实机器人任务上的广泛实验表明,相较于基于提示的基线方法,GSR在零样本泛化与长时程任务完成方面取得显著提升。这些结果凸显了显式世界状态表示作为可扩展具身推理的关键归纳偏置。