Vision-Language Navigation (VLN) requires agents to follow natural language instructions in partially observed 3D environments, motivating map representations that aggregate spatial context beyond local perception. However, most existing approaches rely on hand-crafted maps constructed independently of the navigation policy. We argue that maps should instead be learned representations shaped directly by navigation objectives rather than exhaustive reconstructions. Based on this insight, we propose MapDream, a map-in-the-loop framework that formulates map construction as autoregressive bird's-eye-view (BEV) image synthesis. The framework jointly learns map generation and action prediction, distilling environmental context into a compact three-channel BEV map that preserves only navigation-critical affordances. Supervised pre-training bootstraps a reliable mapping-to-control interface, while the autoregressive design enables end-to-end joint optimization through reinforcement fine-tuning. Experiments on R2R-CE and RxR-CE achieve state-of-the-art monocular performance, validating task-driven generative map learning.
翻译:视觉语言导航(VLN)要求智能体在局部可观测的三维环境中遵循自然语言指令,因此需要能聚合超出局部感知范围的空间上下文的地图表示。然而,大多数现有方法依赖独立于导航策略构建的手工设计地图。我们认为,地图应作为由导航目标直接塑造的学习表示,而非穷尽式的重建。基于这一认识,我们提出MapDream——一种地图在环框架,将地图构建 formulate 为自回归鸟瞰图(BEV)图像生成。该框架联合学习地图生成与动作预测,将环境上下文蒸馏为仅保留导航关键可供性的紧凑三通道BEV地图。监督预训练为可靠的映射-控制接口提供初始引导,而自回归设计则通过强化学习微调实现端到端联合优化。在R2R-CE和RxR-CE上的实验达到了单目视觉方法的先进水平,验证了任务驱动生成式地图学习的有效性。