Language understanding is essential for the navigation agent to follow instructions. We observe two kinds of issues in the instructions that can make the navigation task challenging: 1. The mentioned landmarks are not recognizable by the navigation agent due to the different vision abilities of the instructor and the modeled agent. 2. The mentioned landmarks are applicable to multiple targets, thus not distinctive for selecting the target among the candidate viewpoints. To deal with these issues, we design a translator module for the navigation agent to convert the original instructions into easy-to-follow sub-instruction representations at each step. The translator needs to focus on the recognizable and distinctive landmarks based on the agent's visual abilities and the observed visual environment. To achieve this goal, we create a new synthetic sub-instruction dataset and design specific tasks to train the translator and the navigation agent. We evaluate our approach on Room2Room~(R2R), Room4room~(R4R), and Room2Room Last (R2R-Last) datasets and achieve state-of-the-art results on multiple benchmarks.
翻译:语言理解对于导航智能体遵循指令至关重要。我们观察到指令中存在的两类问题会增加导航任务的难度:1. 由于指令发出者与被建模智能体视觉能力的差异,指令提及的地标无法被导航智能体识别;2. 指令提及的地标适用于多个目标,因此在候选视角中不具有区分性。为解决这些问题,我们为导航智能体设计了一个翻译器模块,将原始指令在每一步转换为易于遵循的子指令表示。该翻译器需要基于智能体的视觉能力及观测到的视觉环境,聚焦于可识别且具有区分性的地标。为实现此目标,我们构建了新的合成子指令数据集,并设计特定任务来训练翻译器与导航智能体。我们在Room2Room(R2R)、Room4Room(R4R)及Room2Room Last(R2R-Last)数据集上评估了该方法,并在多个基准测试中取得了最先进的结果。