AstraNav-World: World Model for Foresight Control and Consistency

Jintao Chen,Junjun Hu,Haochen Bai,Minghua Luo,Xinda Xue,Botao Ren,Chengyu Bai,Shichao Xie,Ziyi Chen,Fei Liu,Zedong Chu,Xiaolong Wu,Mu Xu,Shanghang Zhang

Embodied navigation in open, dynamic environments demands accurate foresight of how the world will evolve and how actions will unfold over time. We propose AstraNav-World, an end-to-end world model that jointly reasons about future visual states and action sequences within a unified probabilistic framework. Our framework integrates a diffusion-based video generator with a vision-language policy, enabling synchronized rollouts where predicted scenes and planned actions are updated simultaneously. Training optimizes two complementary objectives: generating action-conditioned multi-step visual predictions and deriving trajectories conditioned on those predicted visuals. This bidirectional constraint makes visual predictions executable and keeps decisions grounded in physically consistent, task-relevant futures, mitigating cumulative errors common in decoupled "envision-then-plan" pipelines. Experiments across diverse embodied navigation benchmarks show improved trajectory accuracy and higher success rates. Ablations confirm the necessity of tight vision-action coupling and unified training, with either branch removal degrading both prediction quality and policy reliability. In real-world testing, AstraNav-World demonstrated exceptional zero-shot capabilities, adapting to previously unseen scenarios without any real-world fine-tuning. These results suggest that AstraNav-World captures transferable spatial understanding and planning-relevant navigation dynamics, rather than merely overfitting to simulation-specific data distribution. Overall, by unifying foresight vision and control within a single generative model, we move closer to reliable, interpretable, and general-purpose embodied agents that operate robustly in open-ended real-world settings.

翻译：在开放、动态环境中进行具身导航，需要对世界如何演变以及动作如何随时间展开具备精确的前瞻能力。我们提出AstraNav-World，一种端到端的世界模型，在统一概率框架内联合推理未来视觉状态与动作序列。该框架将基于扩散的视频生成器与视觉-语言策略相结合，实现同步展开，其中预测场景与规划动作同时更新。训练过程优化两个互补目标：生成动作条件化的多步视觉预测，并基于这些预测视觉推导出轨迹。这种双向约束使得视觉预测具备可执行性，并确保决策扎根于物理一致、任务相关的未来，从而缓解解耦式“先设想后规划”流程中常见的累积误差。在多种具身导航基准上的实验表明，轨迹精度与成功率均有所提升。消融实验证实了紧密视觉-动作耦合与统一训练的必要性，任一分支的移除都会降低预测质量与策略可靠性。在真实世界测试中，AstraNav-World展现出卓越的零样本能力，无需任何真实微调即可适应未见场景。这些结果表明，AstraNav-World捕捉到可迁移的空间理解及与规划相关的导航动态，而非仅仅过拟合于仿真特定的数据分布。总体而言，通过将前瞻视觉与控制统一于单个生成模型，我们更接近能够在开放式真实世界环境中稳健运行的可靠、可解释且通用的具身智能体。