AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild

Vision-language navigation (VLN) requires intelligent agents to navigate environments by interpreting linguistic instructions alongside visual observations, serving as a cornerstone task in Embodied AI. Current VLN research for unmanned aerial vehicles (UAVs) relies on detailed, pre-specified instructions to guide the UAV along predetermined routes. However, real-world outdoor exploration typically occurs in unknown environments where detailed navigation instructions are unavailable. Instead, only coarse-grained positional or directional guidance can be provided, requiring UAVs to autonomously navigate through continuous planning and obstacle avoidance. To bridge this gap, we propose AutoFly, an end-to-end Vision-Language-Action (VLA) model for autonomous UAV navigation. AutoFly incorporates a pseudo-depth encoder that derives depth-aware features from RGB inputs to enhance spatial reasoning, coupled with a progressive two-stage training strategy that effectively aligns visual, depth, and linguistic representations with action policies. Moreover, existing VLN datasets have fundamental limitations for real-world autonomous navigation, stemming from their heavy reliance on explicit instruction-following over autonomous decision-making and insufficient real-world data. To address these issues, we construct a novel autonomous navigation dataset that shifts the paradigm from instruction-following to autonomous behavior modeling through: (1) trajectory collection emphasizing continuous obstacle avoidance, autonomous planning, and recognition workflows; (2) comprehensive real-world data integration. Experimental results demonstrate that AutoFly achieves a 3.9% higher success rate compared to state-of-the-art VLA baselines, with consistent performance across simulated and real environments.

翻译：视觉语言导航（VLN）要求智能体通过结合语言指令与视觉观察来导航环境，是具身人工智能的基础任务之一。当前面向无人机（UAV）的VLN研究依赖于详细、预先指定的指令来引导无人机沿预定路径飞行。然而，现实世界的户外探索通常发生在未知环境中，无法获得详细的导航指令，只能提供粗粒度的位置或方向引导，这要求无人机通过持续规划与避障实现自主导航。为弥补这一差距，我们提出了AutoFly——一种面向无人机自主导航的端到端视觉-语言-动作（VLA）模型。AutoFly引入了一种伪深度编码器，可从RGB输入中提取深度感知特征以增强空间推理能力，并结合渐进式两阶段训练策略，有效对齐视觉、深度和语言表征与动作策略。此外，现有VLN数据集因其过度依赖显式指令跟随而缺乏自主决策能力，且真实世界数据不足，难以支撑现实自主导航任务。为解决这些问题，我们构建了一个新颖的自主导航数据集，通过以下方式将范式从指令跟随转向自主行为建模：（1）轨迹收集强调连续避障、自主规划与识别流程；（2）全面的真实世界数据整合。实验结果表明，AutoFly相比最先进的VLA基线方法成功率提升3.9%，且在仿真与真实环境中均表现一致。