Precise 3D environmental mapping is pivotal in robotics. Existing methods often rely on predefined concepts during training or are time-intensive when generating semantic maps. This paper presents Open-Fusion, a groundbreaking approach for real-time open-vocabulary 3D mapping and queryable scene representation using RGB-D data. Open-Fusion harnesses the power of a pre-trained vision-language foundation model (VLFM) for open-set semantic comprehension and employs the Truncated Signed Distance Function (TSDF) for swift 3D scene reconstruction. By leveraging the VLFM, we extract region-based embeddings and their associated confidence maps. These are then integrated with 3D knowledge from TSDF using an enhanced Hungarian-based feature-matching mechanism. Notably, Open-Fusion delivers outstanding annotation-free 3D segmentation for open-vocabulary without necessitating additional 3D training. Benchmark tests on the ScanNet dataset against leading zero-shot methods highlight Open-Fusion's superiority. Furthermore, it seamlessly combines the strengths of region-based VLFM and TSDF, facilitating real-time 3D scene comprehension that includes object concepts and open-world semantics. We encourage the readers to view the demos on our project page: https://uark-aicv.github.io/OpenFusion
翻译:精确的三维环境建图是机器人学的关键。现有方法通常依赖训练时预定义的概念,或在生成语义地图时耗时较长。本文提出Open-Fusion,一种基于RGB-D数据的实时开放词汇三维建图与可查询场景表征的创新性方法。Open-Fusion利用预训练视觉-语言基础模型(VLFM)实现开放集语义理解,并采用截断符号距离函数(TSDF)进行快速三维场景重建。通过VLFM,我们提取基于区域的嵌入及其对应的置信度图,随后借助增强型匈牙利算法特征匹配机制,将其与TSDF的三维知识相融合。值得注意的是,Open-Fusion无需额外三维训练即可实现出色的无标注开放词汇三维分割。在ScanNet数据集上针对领先的零样本方法的基准测试凸显了Open-Fusion的优越性。此外,它无缝结合了基于区域的VLFM与TSDF的优势,支持包含物体概念与开放世界语义的实时三维场景理解。我们鼓励读者访问项目页面查看演示:https://uark-aicv.github.io/OpenFusion