VLN-CE is a recently released embodied task, where AI agents need to navigate a freely traversable environment to reach a distant target location, given language instructions. It poses great challenges due to the huge space of possible strategies. Driven by the belief that the ability to anticipate the consequences of future actions is crucial for the emergence of intelligent and interpretable planning behavior, we propose DREAMWALKER -- a world model based VLN-CE agent. The world model is built to summarize the visual, topological, and dynamic properties of the complicated continuous environment into a discrete, structured, and compact representation. DREAMWALKER can simulate and evaluate possible plans entirely in such internal abstract world, before executing costly actions. As opposed to existing model-free VLN-CE agents simply making greedy decisions in the real world, which easily results in shortsighted behaviors, DREAMWALKER is able to make strategic planning through large amounts of ``mental experiments.'' Moreover, the imagined future scenarios reflect our agent's intention, making its decision-making process more transparent. Extensive experiments and ablation studies on VLN-CE dataset confirm the effectiveness of the proposed approach and outline fruitful directions for future work.
翻译:摘要:VLN-CE是一项近期发布的具身任务,要求AI代理根据语言指令在可自由遍历的环境中导航至远处目标位置。由于可能策略空间巨大,该任务极具挑战性。基于"预见未来行动后果的能力对智能且可解释的规划行为至关重要"这一信念,我们提出DREAMWALKER——一种基于世界模型的VLN-CE代理。该世界模型将复杂连续环境的视觉、拓扑和动态属性归纳为离散化、结构化且紧凑的表示。DREAMWALKER能够在执行代价高昂的行动之前,完全在内部抽象世界中模拟和评估可能的规划方案。与现有无模型VLN-CE代理在现实世界中做出贪婪决策(容易导致短视行为)不同,DREAMWALKER能够通过大量"心理实验"进行战略规划。此外,其想象出的未来场景反映了代理的意图,使得决策过程更加透明。在VLN-CE数据集上的大量实验和消融研究证实了所提方法的有效性,并为未来工作指明了富有成效的方向。