Large Language Models (LLMs) can help robots reason about abstract task specifications. This requires augmenting classical representations of the environment used by robots, such as point-clouds and meshes, with natural language-based priors. There are a number of approaches to do so in the existing literature. While some navigation frameworks leverage scene-level semantics at the expense of object-level detail, others such as language-guided neural radiance fields (NeRFs) or segment-anything 3D (SAM3D) prioritize object accuracy over global scene context. This paper argues that we can get the best of both worlds. We use a Unitree Go2 quadruped with a RealSense stereo camera (RGB-D data) to build an explicit metric-semantic representation of indoor environments. This is a scene-scale representation with each object (e.g., chairs, couches, doors, of various shapes and sizes) represented by a detailed mesh, its category, and a pose. We show that this representation is more accurate than foundation-model-based maps such as those built by SAM3D, as well as state-of-the-art scene-level robotics mapping pipelines such as Clio (Maggio et al., 2024). Our implementation is about 25$\times$ faster than SAM3D and is about 10$\times$ slower than Clio. We can also adapt our approach to enable open-set scene-level mapping, i.e., when object meshes are not known a priori, by building upon SAM3D to further improve precision and recall. We show how this representation can be readily used with LLMs such as Google's Gemini to demonstrate scene understanding, complex inferences, and planning. We also display the utility of having these representations for semantic navigation in simulated warehouse and hospital settings using Nvidia's Issac Sim.
翻译:大型语言模型(LLM)能够辅助机器人对抽象任务规范进行推理。这需要为机器人使用的传统环境表示(如点云和网格)增加基于自然语言的先验信息。现有文献中存在多种实现方法。一些导航框架利用场景级语义信息,但牺牲了物体级细节;而另一些方法,如语言引导的神经辐射场(NeRF)或Segment-Anything 3D(SAM3D),则优先考虑物体精度而非全局场景上下文。本文论证了我们可以兼顾两者之长。我们使用搭载RealSense立体相机(RGB-D数据)的Unitree Go2四足机器人,构建了室内环境的显式度量语义表示。这是一种场景尺度的表示,其中每个物体(例如各种形状和尺寸的椅子、沙发、门)均由一个详细的网格、其类别以及一个位姿来表示。我们证明,相较于基于基础模型的映射(如SAM3D构建的映射)以及最先进的场景级机器人学建图流程(如Clio(Maggio等人,2024)),这种表示方法更为精确。我们的实现比SAM3D快约25倍,比Clio慢约10倍。我们还可以通过基于SAM3D进行改进以进一步提升精确率和召回率,从而调整我们的方法以实现开放集场景级建图,即在物体网格未知的情况下进行建图。我们展示了如何将这种表示与Google的Gemini等LLM直接结合使用,以展示场景理解、复杂推理和规划能力。我们还展示了在Nvidia的Isaac Sim中,利用这些表示在模拟仓库和医院环境中进行语义导航的实用性。