The task of Visual Object Navigation (VON) involves an agent's ability to locate a particular object within a given scene. In order to successfully accomplish the VON task, two essential conditions must be fulfilled:1) the user must know the name of the desired object; and 2) the user-specified object must actually be present within the scene. To meet these conditions, a simulator can incorporate pre-defined object names and positions into the metadata of the scene. However, in real-world scenarios, it is often challenging to ensure that these conditions are always met. Human in an unfamiliar environment may not know which objects are present in the scene, or they may mistakenly specify an object that is not actually present. Nevertheless, despite these challenges, human may still have a demand for an object, which could potentially be fulfilled by other objects present within the scene in an equivalent manner. Hence, we propose Demand-driven Navigation (DDN), which leverages the user's demand as the task instruction and prompts the agent to find the object matches the specified demand. DDN aims to relax the stringent conditions of VON by focusing on fulfilling the user's demand rather than relying solely on predefined object categories or names. We propose a method first acquire textual attribute features of objects by extracting common knowledge from a large language model. These textual attribute features are subsequently aligned with visual attribute features using Contrastive Language-Image Pre-training (CLIP). By incorporating the visual attribute features as prior knowledge, we enhance the navigation process. Experiments on AI2Thor with the ProcThor dataset demonstrate the visual attribute features improve the agent's navigation performance and outperform the baseline methods commonly used in VON.
翻译:视觉物体导航(VON)任务要求智能体能够在给定场景中定位特定物体。为了成功完成该任务,需满足两个必要条件:1)用户必须知道所需物体的名称;2)用户指定的物体必须实际存在于场景中。为满足这些条件,仿真器可将预定义的物体名称和位置纳入场景元数据。然而在真实场景中,同时满足这些条件往往具有挑战性。处于陌生环境中的人类可能不知道场景中存在哪些物体,或可能错误指定实际不存在的物体。尽管如此,人类仍可能对某类物体产生需求,而这种需求可能通过场景中其他等价的物体得到满足。为此,我们提出需求驱动导航(DDN),该任务利用用户需求作为任务指令,引导智能体寻找符合指定需求的物体。DDN旨在通过关注满足用户需求而非仅依赖预定义物体类别或名称,来放宽VON的严格约束条件。我们提出的方法首先通过从大语言模型中提取常识知识,获取物体的文本属性特征;随后利用对比语言-图像预训练(CLIP)将文本属性特征与视觉属性特征对齐。通过将视觉属性特征作为先验知识,我们增强了导航过程。在AI2Thor平台使用ProcThor数据集的实验表明,视觉属性特征提升了智能体的导航性能,并优于VON中常用的基线方法。