Vision-and-Language Navigation (VLN) aims to navigate to the target location by following a given instruction. Unlike existing methods focused on predicting a more accurate action at each step in navigation, in this paper, we make the first attempt to tackle a long-ignored problem in VLN: narrowing the gap between Success Rate (SR) and Oracle Success Rate (OSR). We observe a consistently large gap (up to 9%) on four state-of-the-art VLN methods across two benchmark datasets: R2R and REVERIE. The high OSR indicates the robot agent passes the target location, while the low SR suggests the agent actually fails to stop at the target location at last. Instead of predicting actions directly, we propose to mine the target location from a trajectory given by off-the-shelf VLN models. Specially, we design a multi-module transformer-based model for learning compact discriminative trajectory viewpoint representation, which is used to predict the confidence of being a target location as described in the instruction. The proposed method is evaluated on three widely-adopted datasets: R2R, REVERIE and NDH, and shows promising results, demonstrating the potential for more future research.
翻译:视觉与语言导航(VLN)旨在通过遵循给定指令导航至目标位置。与现有方法聚焦于在导航中每一步预测更精确的动作不同,本文首次尝试解决VLN中一个长期被忽视的问题:缩小成功率(SR)与Oracle成功率(OSR)之间的差距。我们在两个基准数据集(R2R和REVERIE)上的四种最先进VLN方法中观察到持续存在的较大差距(高达9%)。高OSR表明机器人智能体通过了目标位置,而低SR则说明智能体最终未能准确停止在目标位置。我们并非直接预测动作,而是提出从现成VLN模型生成的轨迹中挖掘目标位置。具体而言,我们设计了一个基于多模块Transformer的模型,用于学习紧凑的判别性轨迹视点表示,从而预测该视点是否为指令所述目标位置的置信度。该方法在三个广泛采用的数据集(R2R、REVERIE和NDH)上进行了评估,展现出令人鼓舞的结果,表明其具有进一步研究的潜力。