Unmanned aerial vehicles (UAVs) have emerged as powerful embodied agents. One of the core abilities is autonomous navigation in large-scale three-dimensional environments. Existing navigation policies, however, are typically optimized for low-level objectives such as obstacle avoidance and trajectory smoothness, lacking the ability to incorporate high-level semantics into planning. To bridge this gap, we propose ANWM, an aerial navigation world model that predicts future visual observations conditioned on past frames and actions, thereby enabling agents to rank candidate trajectories by their semantic plausibility and navigational utility. ANWM is trained on 4-DoF UAV trajectories and introduces a physics-inspired module: Future Frame Projection (FFP), which projects past frames into future viewpoints to provide coarse geometric priors. This module mitigates representational uncertainty in long-distance visual generation and captures the mapping between 3D trajectories and egocentric observations. Empirical results demonstrate that ANWM significantly outperforms existing world models in long-distance visual forecasting and improves UAV navigation success rates in large-scale environments.
翻译:无人机已成为强大的具身智能体。其核心能力之一是在大规模三维环境中的自主导航。然而,现有的导航策略通常针对避障和轨迹平滑等低层目标进行优化,缺乏将高层语义信息融入规划的能力。为弥补这一差距,我们提出了ANWM,一种空中导航世界模型,该模型能够基于历史帧和动作预测未来的视觉观测,从而使智能体能够根据候选轨迹的语义合理性和导航效用对其进行排序。ANWM在四自由度无人机轨迹上进行训练,并引入了一个受物理学启发的模块:未来帧投影(FFP)。该模块将历史帧投影到未来视点,以提供粗略的几何先验。此模块减轻了长距离视觉生成中的表征不确定性,并捕捉了三维轨迹与自我中心观测之间的映射关系。实证结果表明,ANWM在长距离视觉预测方面显著优于现有世界模型,并提高了无人机在大规模环境中的导航成功率。