WAM-Nav: Asymmetric Latent World-Action Modeling for Unified Visual Navigation

Visual navigation requires generating smooth and collision-free trajectories under complex geometric and physical constraints. Existing reactive policies that directly map observations to actions lack anticipatory reasoning, limiting their ability to proactively avoid obstacles. While visual imagination offers predictive foresight, conventional modular approaches separate scene prediction from policy learning, often leading to error accumulation and inefficient inference. To address these limitations, we propose WAM-Nav, a Latent World-Action Model for embodied visual navigation that jointly learns action generation and latent visual foresight, enabling more robust and foresighted navigation decisions without compromising inference efficiency. Specifically, WAM-Nav utilizes a shared Diffusion Transformer for asymmetric joint diffusion to concurrently generate long-horizon actions and short-horizon visual foresight, reducing the inference latency and visual error accumulation inherent in multi-step autoregressive rollouts. To further encourage smooth and consistent trajectory generation, we introduce a dual-stream contextual conditioning mechanism that integrates episode-level ego-motion history with sequential visual observations. Combined with a unified goal alignment module that preserves balanced representations across goal types, WAM-Nav naturally supports Image-Goal, Point-Goal, and No-Goal exploration within a single policy. Extensive experiments on the challenging ClutterScenes and InternScenes benchmarks demonstrate strong generalization of WAM-Nav, particularly on Image-Goal and Point-Goal navigation, where it improves success rates by 15.7% and 3.3%, respectively. Real-world deployment further validates effective zero-shot sim-to-real transfer, achieving an average 85% task success rate across diverse indoor and outdoor environments.

翻译：视觉导航要求在复杂几何与物理约束下生成平滑且无碰撞的轨迹。现有将观测直接映射到动作的反应式策略缺乏前瞻推理能力，限制了其主动规避障碍物的效能。尽管视觉想象可提供预测性前馈，传统模块化方法将场景预测与策略学习分离，常导致误差累积与推理效率低下。针对上述局限，我们提出WAM-Nav——面向具身视觉导航的潜世界动作模型，该模型联合学习动作生成与潜视觉前馈，在不牺牲推理效率的前提下实现更鲁棒且具前瞻性的导航决策。具体而言，WAM-Nav利用共享扩散Transformer进行非对称联合扩散，同步生成长时域动作与短时域视觉前馈，有效降低多步自回归展开固有的推理延迟与视觉误差累积。为进一步促进平滑一致的轨迹生成，我们引入双流上下文条件机制，将跨幕级的自运动历史与序列化视觉观测相融合。结合统一目标对齐模块（该模块可保持跨目标类型的均衡表征），WAM-Nav在单一策略中自然支持图像目标、点目标与无目标三种探索模式。在具有挑战性的ClutterScenes与InternScenes基准上的大量实验表明，WAM-Nav展现出强泛化能力，尤其在图像目标与点目标导航任务中，成功率分别提升15.7%与3.3%。实际场景部署进一步验证了其有效的零样本虚拟-现实迁移能力，在多样化室内外环境中实现了平均85%的任务成功率。