Humans excel at forming mental maps of their surroundings, equipping them to understand object relationships and navigate based on language queries. Our previous work SI Maps [1] showed that having instance-level information and the semantic understanding of an environment helps significantly improve performance for language-guided tasks. We extend this instance-level approach to 3D while increasing the pipeline's robustness and improving quantitative and qualitative results. Our method leverages foundational models for object recognition, image segmentation, and feature extraction. We propose a representation that results in a 3D point cloud map with instance-level embeddings, which bring in the semantic understanding that natural language commands can query. Quantitatively, the work improves upon the success rate of language-guided tasks. At the same time, we qualitatively observe the ability to identify instances more clearly and leverage the foundational models and language and image-aligned embeddings to identify objects that, otherwise, a closed-set approach wouldn't be able to identify.
翻译:人类擅长构建周围环境的心理地图,从而能够理解物体关系并基于语言查询进行导航。我们先前的工作SI Maps [1]表明,实例级信息与环境语义理解能够显著提升语言引导任务的性能。本研究将这一实例级方法扩展到三维场景,同时增强了管线的鲁棒性,并改进了定量与定性结果。我们的方法利用基础模型进行物体识别、图像分割和特征提取。我们提出一种表征方式,生成具备实例级嵌入的三维点云地图,由此引入自然语言指令可查询的语义理解能力。在定量层面,本工作提升了语言引导任务的成功率;在定性层面,我们观察到实例识别更加清晰,并借助基础模型及语言-图像对齐嵌入,能够识别出闭集方法无法识别的物体。