Representing and understanding 3D environments in a structured manner is crucial for autonomous agents to navigate and reason about their surroundings. While traditional Simultaneous Localization and Mapping (SLAM) methods generate metric reconstructions and can be extended to metric-semantic mapping, they lack a higher level of abstraction and relational reasoning. To address this gap, 3D scene graphs have emerged as a powerful representation for capturing hierarchical structures and object relationships. In this work, we propose an enhanced hierarchical 3D scene graph that integrates open-vocabulary features across multiple abstraction levels and supports object-relational reasoning. Our approach leverages a Vision Language Model (VLM) to infer semantic relationships. Notably, we introduce a task reasoning module that combines Large Language Models (LLM) and a VLM to interpret the scene graph's semantic and relational information, enabling agents to reason about tasks and interact with their environment more intelligently. We validate our method by deploying it on a quadruped robot in multiple environments and tasks, highlighting its ability to reason about them.
翻译:以结构化方式表示和理解三维环境对于自主智能体导航和推理其周围环境至关重要。虽然传统的同步定位与建图方法生成度量重建并可扩展为度量-语义建图,但它们缺乏更高层次的抽象和关系推理能力。为弥补这一不足,三维场景图已成为捕捉层次化结构和物体关系的强大表示方法。本研究提出一种增强的层次化三维场景图,该图在多个抽象层次上整合开放词汇特征,并支持物体关系推理。我们的方法利用视觉语言模型推断语义关系。值得注意的是,我们引入了一个任务推理模块,该模块结合大型语言模型和视觉语言模型来解析场景图的语义与关系信息,使智能体能够推理任务并更智能地与环境交互。我们通过在四足机器人的多种环境和任务中部署该方法进行验证,突显了其任务推理能力。