The ability to accurately locate and navigate to a specific object is a crucial capability for embodied agents that operate in the real world and interact with objects to complete tasks. Such object navigation tasks usually require large-scale training in visual environments with labeled objects, which generalizes poorly to novel objects in unknown environments. In this work, we present a novel zero-shot object navigation method, Exploration with Soft Commonsense constraints (ESC), that transfers commonsense knowledge in pre-trained models to open-world object navigation without any navigation experience nor any other training on the visual environments. First, ESC leverages a pre-trained vision and language model for open-world prompt-based grounding and a pre-trained commonsense language model for room and object reasoning. Then ESC converts commonsense knowledge into navigation actions by modeling it as soft logic predicates for efficient exploration. Extensive experiments on MP3D, HM3D, and RoboTHOR benchmarks show that our ESC method improves significantly over baselines, and achieves new state-of-the-art results for zero-shot object navigation (e.g., 288% relative Success Rate improvement than CoW on MP3D).
翻译:准确识别并导航至特定目标物体,是具身智能体在现实世界中执行物体交互任务的关键能力。此类目标导航任务通常需要在包含标注物体的视觉环境中进行大规模训练,但在未知环境中面对新物体时泛化能力较弱。本文提出一种新颖的零样本目标导航方法——基于软常识约束的探索(ESC),该方法将预训练模型中的常识知识迁移至开放世界目标导航任务,无需任何导航经验或视觉环境训练。首先,ESC利用预训练的视觉语言模型进行开放世界提示驱动的地面识别,同时采用预训练的常识语言模型进行房间与物体推理。随后,ESC将常识知识建模为软逻辑谓词,将其转化为高效探索的导航动作。在MP3D、HM3D和RoboTHOR基准上的大量实验表明,ESC方法在基线基础上取得显著提升,并创下零样本目标导航任务的最新最优结果(例如,在MP3D上相对成功率较CoW提升288%)。