ConceptFusion: Open-set Multimodal 3D Mapping

Krishna Murthy Jatavallabhula,Alihusein Kuwajerwala,Qiao Gu,Mohd Omama,Tao Chen,Shuang Li,Ganesh Iyer,Soroush Saryazdi,Nikhil Keetha,Ayush Tewari,Joshua B. Tenenbaum,Celso Miguel de Melo,Madhava Krishna,Liam Paull,Florian Shkurti,Antonio Torralba

Building 3D maps of the environment is central to robot navigation, planning, and interaction with objects in a scene. Most existing approaches that integrate semantic concepts with 3D maps largely remain confined to the closed-set setting: they can only reason about a finite set of concepts, pre-defined at training time. Further, these maps can only be queried using class labels, or in recent work, using text prompts. We address both these issues with ConceptFusion, a scene representation that is (1) fundamentally open-set, enabling reasoning beyond a closed set of concepts and (ii) inherently multimodal, enabling a diverse range of possible queries to the 3D map, from language, to images, to audio, to 3D geometry, all working in concert. ConceptFusion leverages the open-set capabilities of today's foundation models pre-trained on internet-scale data to reason about concepts across modalities such as natural language, images, and audio. We demonstrate that pixel-aligned open-set features can be fused into 3D maps via traditional SLAM and multi-view fusion approaches. This enables effective zero-shot spatial reasoning, not needing any additional training or finetuning, and retains long-tailed concepts better than supervised approaches, outperforming them by more than 40% margin on 3D IoU. We extensively evaluate ConceptFusion on a number of real-world datasets, simulated home environments, a real-world tabletop manipulation task, and an autonomous driving platform. We showcase new avenues for blending foundation models with 3D open-set multimodal mapping. For more information, visit our project page https://concept-fusion.github.io or watch our 5-minute explainer video https://www.youtube.com/watch?v=rkXgws8fiDs

翻译：构建环境的三维地图是机器人导航、规划及场景内物体交互的核心环节。现有将语义概念与三维地图融合的方法大多局限于封闭集场景：仅能对训练时预定义的有限概念集进行推理。此外，这些地图仅能通过类标签查询，或如近期研究所示，通过文本提示查询。我们提出的ConceptFusion场景表征同时解决了这两个问题：它（1）本质上是开放集的，能对封闭集之外的概念进行推理；（2）天生是多模态的，支持对三维地图进行从语言、图像、音频到三维几何等多样化查询，且各模态协同工作。ConceptFusion利用当前基于互联网规模数据预训练的各类基础模型的开放集能力，对跨模态概念（如自然语言、图像、音频）进行推理。我们证明，通过传统SLAM和多视图融合方法，像素对齐的开放集特征可被融合至三维地图中。这实现了有效的零样本空间推理，无需任何额外训练或微调，且在长尾概念保留方面优于监督方法——在三维IoU指标上提升幅度超过40%。我们在多个真实世界数据集、模拟家居环境、真实桌面操控任务以及自动驾驶平台上对ConceptFusion进行了全面评估。我们展示了将基础模型与三维开放集多模态地图融合的新途径。更多信息请访问项目页面https://concept-fusion.github.io或观看5分钟讲解视频https://www.youtube.com/watch?v=rkXgws8fiDs。