We propose CLIP-Fields, an implicit scene model that can be used for a variety of tasks, such as segmentation, instance identification, semantic search over space, and view localization. CLIP-Fields learns a mapping from spatial locations to semantic embedding vectors. Importantly, we show that this mapping can be trained with supervision coming only from web-image and web-text trained models such as CLIP, Detic, and Sentence-BERT; and thus uses no direct human supervision. When compared to baselines like Mask-RCNN, our method outperforms on few-shot instance identification or semantic segmentation on the HM3D dataset with only a fraction of the examples. Finally, we show that using CLIP-Fields as a scene memory, robots can perform semantic navigation in real-world environments. Our code and demonstration videos are available here: https://mahis.life/clip-fields
翻译:我们提出了CLIP-Fields,这是一种隐式场景模型,可用于多种任务,例如分割、实例识别、空间语义搜索以及视角定位。CLIP-Fields学习从空间位置到语义嵌入向量的映射。重要的是,我们证明了这种映射可以通过仅来自网络图像和网络文本训练的模型(如CLIP、Detic和Sentence-BERT)的监督信号进行训练,因此无需直接的人工监督。与Mask-RCNN等基线方法相比,我们的方法在HM3D数据集上的少样本实例识别或语义分割任务中表现更优,且仅需极少量的训练样本。最后,我们展示了将CLIP-Fields用作场景记忆时,机器人能够在真实世界环境中执行语义导航。我们的代码和演示视频可在此处获取:https://mahis.life/clip-fields