Vision-and-language navigation (VLN) simulates a visual agent that follows natural-language navigation instructions in real-world scenes. Existing approaches have made enormous progress in navigation in new environments, such as beam search, pre-exploration, and dynamic or hierarchical history encoding. To balance generalization and efficiency, we resort to memorizing visited scenarios apart from the ongoing route while navigating. In this work, we introduce a mechanism of Episodic Scene memory (ESceme) for VLN that wakes an agent's memories of past visits when it enters the current scene. The episodic scene memory allows the agent to envision a bigger picture of the next prediction. This way, the agent learns to utilize dynamically updated information instead of merely adapting to static observations. We provide a simple yet effective implementation of ESceme by enhancing the accessible views at each location and progressively completing the memory while navigating. We verify the superiority of ESceme on short-horizon (R2R), long-horizon (R4R), and vision-and-dialog (CVDN) VLN tasks. Our ESceme also wins first place on the CVDN leaderboard. Code is available: \url{https://github.com/qizhust/esceme}.}
翻译:视觉-语言导航(VLN)模拟了一个遵循自然语言导航指令在真实场景中移动的视觉智能体。现有方法在新环境导航中取得了巨大进展,例如束搜索、预探索以及动态或分层历史编码。为平衡泛化性与效率,我们选择在导航过程中不仅记录当前路径,还对已访问场景进行记忆。在本工作中,我们提出了一种用于VLN的情景场景记忆(ESceme)机制,该机制在智能体进入当前场景时唤醒其对先前访问的记忆。情景场景记忆使智能体能够设想下一步预测的更全面景象。通过这种方式,智能体学会利用动态更新的信息,而不仅仅是适应静态观测。我们通过增强每个位置的可访问视角并在导航过程中逐步完善记忆,提供了一种简单而有效的ESceme实现。我们在短时域(R2R)、长时域(R4R)以及视觉-对话(CVDN)VLN任务上验证了ESceme的优越性。我们的ESceme还在CVDN排行榜上获得第一名。代码地址:\url{https://github.com/qizhust/esceme}。