Vision-language navigation (VLN) requires intelligent agents to navigate environments by interpreting linguistic instructions alongside visual observations, serving as a cornerstone task in Embodied AI. Current VLN research for unmanned aerial vehicles (UAVs) relies on detailed, pre-specified instructions to guide the UAV along predetermined routes. However, real-world outdoor exploration typically occurs in unknown environments where detailed navigation instructions are unavailable. Instead, only coarse-grained positional or directional guidance can be provided, requiring UAVs to autonomously navigate through continuous planning and obstacle avoidance. To bridge this gap, we propose AutoFly, an end-to-end Vision-Language-Action (VLA) model for autonomous UAV navigation. AutoFly incorporates a pseudo-depth encoder that derives depth-aware features from RGB inputs to enhance spatial reasoning, coupled with a progressive two-stage training strategy that effectively aligns visual, depth, and linguistic representations with action policies. Moreover, existing VLN datasets have fundamental limitations for real-world autonomous navigation, stemming from their heavy reliance on explicit instruction-following over autonomous decision-making and insufficient real-world data. To address these issues, we construct a novel autonomous navigation dataset that shifts the paradigm from instruction-following to autonomous behavior modeling through: (1) trajectory collection emphasizing continuous obstacle avoidance, autonomous planning, and recognition workflows; (2) comprehensive real-world data integration. Experimental results demonstrate that AutoFly achieves a 3.9% higher success rate compared to state-of-the-art VLA baselines, with consistent performance across simulated and real environments.
翻译:视觉语言导航(VLN)要求智能体通过结合语言指令与视觉观察来导航环境,是具身人工智能的基础任务之一。当前面向无人机(UAV)的VLN研究依赖于详细、预先指定的指令来引导无人机沿预定路径飞行。然而,现实世界的户外探索通常发生在未知环境中,无法获得详细的导航指令,只能提供粗粒度的位置或方向引导,这要求无人机通过持续规划与避障实现自主导航。为弥补这一差距,我们提出了AutoFly——一种面向无人机自主导航的端到端视觉-语言-动作(VLA)模型。AutoFly引入了一种伪深度编码器,可从RGB输入中提取深度感知特征以增强空间推理能力,并结合渐进式两阶段训练策略,有效对齐视觉、深度和语言表征与动作策略。此外,现有VLN数据集因其过度依赖显式指令跟随而缺乏自主决策能力,且真实世界数据不足,难以支撑现实自主导航任务。为解决这些问题,我们构建了一个新颖的自主导航数据集,通过以下方式将范式从指令跟随转向自主行为建模:(1)轨迹收集强调连续避障、自主规划与识别流程;(2)全面的真实世界数据整合。实验结果表明,AutoFly相比最先进的VLA基线方法成功率提升3.9%,且在仿真与真实环境中均表现一致。