Embodied navigation in open, dynamic environments demands accurate foresight of how the world will evolve and how actions will unfold over time. We propose AstraNav-World, an end-to-end world model that jointly reasons about future visual states and action sequences within a unified probabilistic framework. Our framework integrates a diffusion-based video generator with a vision-language policy, enabling synchronized rollouts where predicted scenes and planned actions are updated simultaneously. Training optimizes two complementary objectives: generating action-conditioned multi-step visual predictions and deriving trajectories conditioned on those predicted visuals. This bidirectional constraint makes visual predictions executable and keeps decisions grounded in physically consistent, task-relevant futures, mitigating cumulative errors common in decoupled "envision-then-plan" pipelines. Experiments across diverse embodied navigation benchmarks show improved trajectory accuracy and higher success rates. Ablations confirm the necessity of tight vision-action coupling and unified training, with either branch removal degrading both prediction quality and policy reliability. In real-world testing, AstraNav-World demonstrated exceptional zero-shot capabilities, adapting to previously unseen scenarios without any real-world fine-tuning. These results suggest that AstraNav-World captures transferable spatial understanding and planning-relevant navigation dynamics, rather than merely overfitting to simulation-specific data distribution. Overall, by unifying foresight vision and control within a single generative model, we move closer to reliable, interpretable, and general-purpose embodied agents that operate robustly in open-ended real-world settings.
翻译:在开放、动态环境中进行具身导航,需要对世界如何演变以及动作如何随时间展开具备精确的前瞻能力。我们提出AstraNav-World,一种端到端的世界模型,在统一概率框架内联合推理未来视觉状态与动作序列。该框架将基于扩散的视频生成器与视觉-语言策略相结合,实现同步展开,其中预测场景与规划动作同时更新。训练过程优化两个互补目标:生成动作条件化的多步视觉预测,并基于这些预测视觉推导出轨迹。这种双向约束使得视觉预测具备可执行性,并确保决策扎根于物理一致、任务相关的未来,从而缓解解耦式“先设想后规划”流程中常见的累积误差。在多种具身导航基准上的实验表明,轨迹精度与成功率均有所提升。消融实验证实了紧密视觉-动作耦合与统一训练的必要性,任一分支的移除都会降低预测质量与策略可靠性。在真实世界测试中,AstraNav-World展现出卓越的零样本能力,无需任何真实微调即可适应未见场景。这些结果表明,AstraNav-World捕捉到可迁移的空间理解及与规划相关的导航动态,而非仅仅过拟合于仿真特定的数据分布。总体而言,通过将前瞻视觉与控制统一于单个生成模型,我们更接近能够在开放式真实世界环境中稳健运行的可靠、可解释且通用的具身智能体。