WAM-Nav: Asymmetric Latent World-Action Modeling for Unified Visual Navigation

Visual navigation requires generating smooth and collision-free trajectories under complex geometric and physical constraints. Existing reactive policies that directly map observations to actions lack anticipatory reasoning, limiting their ability to proactively avoid obstacles. While visual imagination offers predictive foresight, conventional modular approaches separate scene prediction from policy learning, often leading to error accumulation and inefficient inference. To address these limitations, we propose WAM-Nav, a Latent World-Action Model for embodied visual navigation that jointly learns action generation and latent visual foresight, enabling more robust and foresighted navigation decisions without compromising inference efficiency. Specifically, WAM-Nav utilizes a shared Diffusion Transformer for asymmetric joint diffusion to concurrently generate long-horizon actions and short-horizon visual foresight, reducing the inference latency and visual error accumulation inherent in multi-step autoregressive rollouts. To further encourage smooth and consistent trajectory generation, we introduce a dual-stream contextual conditioning mechanism that integrates episode-level ego-motion history with sequential visual observations. Combined with a unified goal alignment module that preserves balanced representations across goal types, WAM-Nav naturally supports Image-Goal, Point-Goal, and No-Goal exploration within a single policy. Extensive experiments on the challenging ClutterScenes and InternScenes benchmarks demonstrate strong generalization of WAM-Nav, particularly on Image-Goal and Point-Goal navigation, where it improves success rates by 15.7% and 3.3%, respectively. Real-world deployment further validates effective zero-shot sim-to-real transfer, achieving an average 85% task success rate across diverse indoor and outdoor environments.

翻译：视觉导航需要在复杂的几何与物理约束下生成平滑且无碰撞的轨迹。现有直接将观测映射到动作的反应式策略缺乏前瞻性推理能力，限制了其主动避障性能。尽管视觉想象可提供预测性先见，传统模块化方法将场景预测与策略学习分离，常导致误差累积与推理效率低下。为克服这些局限，我们提出WAM-Nav——一种面向具身视觉导航的潜在世界-动作模型，它联合学习动作生成与潜在视觉预见，在不牺牲推理效率的前提下实现更鲁棒且具前瞻性的导航决策。具体而言，WAM-Nav利用共享扩散Transformer进行非对称联合扩散，同时生成长时域动作与短时域视觉预见，从而减少多步自回归展开中固有的推理延迟与视觉误差累积。为促进平滑一致的轨迹生成，我们引入双流上下文条件化机制，将回合级自我运动历史与序列化视觉观测相融合。结合统一目标对齐模块（该模块可跨目标类型保持均衡表征），WAM-Nav天然支持在单一策略下完成图像目标导航、点目标导航与无目标探索。在具有挑战性的ClutterScenes与InternScenes基准上的大量实验表明，WAM-Nav具有强泛化能力，尤其在图像目标与点目标导航中，成功率分别提升15.7%与3.3%。真实场景部署进一步验证了有效的零样本仿真-现实迁移，在多样化室内外环境中实现了平均85%的任务成功率。