Vision-Language Navigation (VLN) requires agents to follow natural language instructions in partially observed 3D environments, motivating map representations that aggregate spatial context beyond local perception. However, most existing approaches rely on hand-crafted maps constructed independently of the navigation policy. We argue that maps should instead be learned representations shaped directly by navigation objectives rather than exhaustive reconstructions. Based on this insight, we propose MapDream, a map-in-the-loop framework that formulates map construction as autoregressive bird's-eye-view (BEV) image synthesis. The framework jointly learns map generation and action prediction, distilling environmental context into a compact three-channel BEV map that preserves only navigation-critical affordances. Supervised pre-training bootstraps a reliable mapping-to-control interface, while the autoregressive design enables end-to-end joint optimization through reinforcement fine-tuning. Experiments on R2R-CE and RxR-CE achieve state-of-the-art monocular performance, validating task-driven generative map learning.
翻译:视觉语言导航要求智能体在部分可观测的三维环境中遵循自然语言指令,这促使地图表示需要聚合超越局部感知的空间上下文信息。然而,现有方法大多依赖于独立于导航策略构建的手工地图。我们认为,地图应当是由导航目标直接塑造的学习型表示,而非详尽的环境重建。基于此洞见,我们提出MapDream——一种地图在环框架,该框架将地图构建建模为自回归鸟瞰图图像生成任务。该框架联合学习地图生成与动作预测,将环境上下文信息蒸馏为紧凑的三通道鸟瞰图地图,仅保留对导航至关重要的环境可操作性信息。监督式预训练构建了可靠的地图-控制接口,而自回归设计则通过强化学习微调实现了端到端的联合优化。在R2R-CE和RxR-CE数据集上的实验取得了单目视觉下的最优性能,验证了任务驱动的生成式地图学习方法的有效性。