Sequential-Horizon Vision-and-Language Navigation (SH-VLN) presents a challenging scenario where agents should sequentially execute multi-task navigation guided by complex, long-horizon language instructions. Current vision-and-language navigation models exhibit significant performance degradation with such multi-task instructions, as information overload impairs the agent's ability to attend to observationally relevant details. To address this problem, we propose SeqWalker, a navigation model built on a hierarchical planning framework. Our SeqWalker features: i) A High-Level Planner that dynamically selects global instructions into contextually relevant sub-instructions based on the agent's current visual observations, thus reducing cognitive load; ii) A Low-Level Planner incorporating an Exploration-Verification strategy that leverages the inherent logical structure of instructions for trajectory error correction. To evaluate SH-VLN performance, we also extend the IVLN dataset and establish a new benchmark. Extensive experiments are performed to demonstrate the superiority of the proposed SeqWalker.
翻译:序列视界视觉语言导航(SH-VLN)提出了一个具有挑战性的场景,要求智能体在复杂、长视界的语言指令引导下,顺序执行多任务导航。现有的视觉语言导航模型在处理此类多任务指令时表现出显著的性能下降,因为信息过载损害了智能体关注观测相关细节的能力。为解决此问题,我们提出了SeqWalker,一个构建在分层规划框架上的导航模型。我们的SeqWalker具备以下特点:i) 高层规划器,能根据智能体当前的视觉观测,动态地将全局指令分解为上下文相关的子指令,从而降低认知负荷;ii) 低层规划器,采用探索-验证策略,利用指令固有的逻辑结构进行轨迹误差校正。为评估SH-VLN性能,我们还扩展了IVLN数据集并建立了一个新的基准。大量实验证明了所提出的SeqWalker的优越性。