Environment maps endowed with sophisticated semantics are pivotal for facilitating seamless interaction between robots and humans, enabling them to effectively carry out various tasks. Open-vocabulary maps, powered by Visual-Language models (VLMs), possess inherent advantages, including multimodal retrieval and open-set classes. However, existing open-vocabulary maps are constrained to closed indoor scenarios and VLM features, thereby diminishing their usability and inference capabilities. Moreover, the absence of topological relationships further complicates the accurate querying of specific instances. In this work, we propose OpenGraph, a representation of open-vocabulary hierarchical graph structure designed for large-scale outdoor environments. OpenGraph initially extracts instances and their captions from visual images using 2D foundation models, encoding the captions with features to enhance textual reasoning. Subsequently, 3D incremental panoramic mapping with feature embedding is achieved by projecting images onto LiDAR point clouds. Finally, the environment is segmented based on lane graph connectivity to construct a hierarchical graph. Validation results from real public dataset SemanticKITTI demonstrate that, even without fine-tuning the models, OpenGraph exhibits the ability to generalize to novel semantic classes and achieve the highest segmentation and query accuracy. The source code of OpenGraph is publicly available at https://github.com/BIT-DYN/OpenGraph.
翻译:具有复杂语义的环境地图对于促进机器人与人之间的无缝交互、使其能够有效执行各类任务至关重要。基于视觉-语言模型构建的开放词汇地图具备多模态检索与开放类别集等固有优势。然而,现有开放词汇地图局限于封闭室内场景与视觉-语言模型特征,这降低了其可用性与推理能力。此外,拓扑关系的缺失进一步增加了特定实例精确查询的复杂性。本研究提出OpenGraph——一种面向大规模户外环境的开放词汇层次化图结构表示方法。OpenGraph首先利用2D基础模型从视觉图像中提取实例及其描述,并通过特征编码增强文本推理能力;其次,通过将图像投影至激光雷达点云,实现具有特征嵌入的三维增量式全景地图构建;最终,依据车道图连通性对环境进行分割以构建层次化图结构。在真实公开数据集SemanticKITTI上的验证结果表明,即便未进行模型微调,OpenGraph仍展现出对全新语义类别的泛化能力,并实现了最高分割与查询精度。OpenGraph的源代码已公开于https://github.com/BIT-DYN/OpenGraph。