Object-oriented embodied navigation aims to locate specific objects, defined by category or depicted in images. Existing methods often struggle to generalize to open vocabulary goals without extensive training data. While recent advances in Vision-Language Models (VLMs) offer a promising solution by extending object recognition beyond predefined categories, efficient goal-oriented exploration becomes more challenging in an open vocabulary setting. We introduce OVExp, a learning-based framework that integrates VLMs for Open-Vocabulary Exploration. OVExp constructs scene representations by encoding observations with VLMs and projecting them onto top-down maps for goal-conditioned exploration. Goals are encoded in the same VLM feature space, and a lightweight transformer-based decoder predicts target locations while maintaining versatile representation abilities. To address the impracticality of fusing dense pixel embeddings with full 3D scene reconstruction for training, we propose constructing maps using low-cost semantic categories and transforming them into CLIP's embedding space via the text encoder. The simple but effective design of OVExp significantly reduces computational costs and demonstrates strong generalization abilities to various navigation settings. Experiments on established benchmarks show OVExp outperforms previous zero-shot methods, can generalize to diverse scenes, and handle different goal modalities.
翻译:面向物体的具身导航旨在定位特定物体,这些物体通过类别定义或图像描绘。现有方法通常难以泛化到开放词汇目标,除非使用大量训练数据。尽管视觉语言模型(VLMs)的最新进展通过将物体识别扩展到预定义类别之外提供了有前景的解决方案,但在开放词汇设置中,高效的目标导向探索变得更加困难。我们提出了OVExp,一个基于学习的框架,它集成VLMs以实现开放词汇探索。OVExp通过使用VLMs编码观测并将其投影到俯视地图上来构建场景表示,以进行目标条件探索。目标被编码在相同的VLM特征空间中,一个轻量级的基于Transformer的解码器预测目标位置,同时保持通用的表示能力。为了解决将密集像素嵌入与完整3D场景重建融合进行训练的不切实际性,我们提出使用低成本语义类别构建地图,并通过文本编码器将其转换到CLIP的嵌入空间。OVExp简单而有效的设计显著降低了计算成本,并展示了对各种导航设置的强大泛化能力。在现有基准测试上的实验表明,OVExp优于先前的零样本方法,能够泛化到多样化的场景,并处理不同的目标模态。