Navigation is a fundamental capability for mobile robots. While the current trend is to use learning-based approaches to replace traditional geometry-based methods, existing end-to-end learning-based policies often struggle with 3D spatial reasoning and lack a comprehensive understanding of physical world dynamics. Integrating world models-which predict future observations conditioned on given actions-with iterative optimization planning offers a promising solution due to their capacity for imagination and flexibility. However, current navigation world models, typically built on pure transformer architectures, often rely on multi-step diffusion processes and autoregressive frame-by-frame generation. These mechanisms result in prohibitive computational latency, rendering real-time deployment impossible. To address this bottleneck, we propose a lightweight navigation world model that adopts a one-step generation paradigm and a 3D U-Net backbone equipped with efficient spatial-temporal attention. This design drastically reduces inference latency, enabling high-frequency control while achieving superior predictive performance. We also integrate this model into an optimization-based planning framework utilizing anchor-based initialization to handle multi-modal goal navigation tasks. Extensive closed-loop experiments in both simulation and real-world environments demonstrate our system's superior efficiency and robustness compared to state-of-the-art baselines.
翻译:导航是移动机器人的基本能力。当前趋势是利用基于学习的方法替代传统基于几何的方法,然而现有的端到端学习策略往往难以进行三维空间推理,且缺乏对物理世界动态的全面理解。世界模型——即根据给定动作预测未来观测的模型——与迭代优化规划相结合,因其想象能力和灵活性而成为一种有前景的解决方案。然而,当前基于纯Transformer架构构建的导航世界模型通常依赖于多步扩散过程和自回归逐帧生成机制。这些机制导致计算延迟过高,无法实现实时部署。为解决这一瓶颈,我们提出了一种轻量级导航世界模型,该模型采用单步生成范式,并配备高效时空注意力的3D U-Net骨干网络。该设计显著降低了推理延迟,在实现高频控制的同时获得了卓越的预测性能。我们还将该模型集成到基于优化的规划框架中,利用基于锚点的初始化方法来处理多模态目标导航任务。在仿真和真实环境中的大量闭环实验表明,相较于最先进的基线方法,我们的系统在效率和鲁棒性方面均表现出显著优势。