We present an Open-Vocabulary 3D Scene Graph (OVSG), a formal framework for grounding a variety of entities, such as object instances, agents, and regions, with free-form text-based queries. Unlike conventional semantic-based object localization approaches, our system facilitates context-aware entity localization, allowing for queries such as ``pick up a cup on a kitchen table" or ``navigate to a sofa on which someone is sitting". In contrast to existing research on 3D scene graphs, OVSG supports free-form text input and open-vocabulary querying. Through a series of comparative experiments using the ScanNet dataset and a self-collected dataset, we demonstrate that our proposed approach significantly surpasses the performance of previous semantic-based localization techniques. Moreover, we highlight the practical application of OVSG in real-world robot navigation and manipulation experiments.
翻译:我们提出了一种开放词汇三维场景图(OVSG)框架,该框架能够基于自由文本形式的查询,对多种实体(如物体实例、智能体和区域)进行形式化定位。与传统的基于语义的目标定位方法不同,我们的系统支持上下文感知的实体定位,例如"拿起厨房桌子上的杯子"或"导航至有人坐着的沙发"。与现有的三维场景图研究相比,OVSG支持自由文本输入和开放词汇查询。通过在ScanNet数据集和自行采集的数据集上进行的一系列对比实验,我们证明所提出的方法显著超越了以往基于语义的定位技术。此外,我们还展示了OVSG在真实机器人导航与操作实验中的实际应用。