Semantics has enabled 3D scene understanding and affordance-driven object interaction. However, robots operating in real-world environments face a critical limitation: they cannot anticipate how objects move. Long-horizon mobile manipulation requires closing the gap between semantics, geometry, and kinematics. In this work, we present MoMa-SG, a novel framework for building semantic-kinematic 3D scene graphs of articulated scenes containing a myriad of interactable objects. Given RGB-D sequences containing multiple object articulations, we temporally segment object interactions and infer object motion using occlusion-robust point tracking. We then lift point trajectories into 3D and estimate articulation models using a novel unified twist estimation formulation that robustly estimates revolute and prismatic joint parameters in a single optimization pass. Next, we associate objects with estimated articulations and detect contained objects by reasoning over parent-child relations at identified opening states. We also introduce the novel Arti4D-Semantic dataset, which uniquely combines hierarchical object semantics including parent-child relation labels with object axis annotations across 62 in-the-wild RGB-D sequences containing 600 object interactions and three distinct observation paradigms. We extensively evaluate the performance of MoMa-SG on two datasets and ablate key design choices of our approach. In addition, real-world experiments on both a quadruped and a mobile manipulator demonstrate that our semantic-kinematic scene graphs enable robust manipulation of articulated objects in everyday home environments. We provide code and data at: https://momasg.cs.uni-freiburg.de.
翻译:语义学已实现三维场景理解与可供性驱动的物体交互。然而,在真实环境中运行的机器人面临一个关键局限:无法预判物体的运动方式。长时域移动操作需弥合语义、几何与运动学之间的鸿沟。本研究提出MoMa-SG——一个为包含大量可交互物体的铰接场景构建语义-运动学三维场景图的新型框架。给定包含多物体铰接运动的RGB-D序列,我们通过时序分割技术解析物体交互过程,并利用抗遮挡点跟踪方法推断物体运动轨迹。随后将二维点轨迹提升至三维空间,并采用新颖的统一旋量估计公式进行关节参数估计,该公式通过单次优化过程即可鲁棒地同时计算旋转关节与平移关节参数。接着,我们将物体与估计的铰接结构相关联,并通过在识别出的开启状态下推理父子层级关系来检测内含物体。本研究还首次提出Arti4D-Semantic数据集,该数据集在62段真实场景RGB-D序列(包含600次物体交互和三种不同观测范式)中,创新性地将包含父子关系标签的层次化物体语义与物体轴线标注相结合。我们在两个数据集上对MoMa-SG进行系统性能评估,并对方法的关键设计进行消融实验。此外,通过四足机器人与移动操作臂的真实场景实验表明,我们构建的语义-运动学场景图能够支持日常家居环境中铰接物体的鲁棒操作。代码与数据发布于:https://momasg.cs.uni-freiburg.de。