We present Visual-Language Fields (VL-Fields), a neural implicit spatial representation that enables open-vocabulary semantic queries. Our model encodes and fuses the geometry of a scene with vision-language trained latent features by distilling information from a language-driven segmentation model. VL-Fields is trained without requiring any prior knowledge of the scene object classes, which makes it a promising representation for the field of robotics. Our model outperformed the similar CLIP-Fields model in the task of semantic segmentation by almost 10%.
翻译:我们提出视觉-语言场(VL-Fields),这是一种神经隐式空间表征,支持开放词汇语义查询。该模型通过从语言驱动的分割模型中蒸馏信息,将场景几何结构与经过视觉-语言训练的潜在特征进行编码与融合。VL-Fields在无需任何场景物体类别先验知识的情况下进行训练,这使其成为机器人领域颇具前景的表征方法。在语义分割任务中,我们的模型性能较同类模型CLIP-Fields提升了近10%。