Aerial vision-language navigation (AVLN) enables UAVs to follow natural-language instructions in complex 3D environments. However, existing zero-shot AVLN methods often suffer from unstable single-stream Vision-Language Model decision-making, unreliable long-horizon progress monitoring, and a trade-off between safety and efficiency. We propose OnFly, a fully onboard, real-time framework for zero-shot AVLN. OnFly adopts a shared-perception dual-agent architecture that decouples high-frequency target generation from low-frequency progress monitoring, thereby stabilizing decision-making. It further employs a hybrid keyframe-recent-frame memory to preserve global trajectory context while maintaining KV-cache prefix stability, enabling reliable long-horizon monitoring with termination and recovery signals. In addition, a semantic-geometric verifier refines VLM-predicted targets for instruction consistency and geometric safety using VLM features and depth cues, while a receding-horizon planner generates optimized collision-free trajectories under geometric safety constraints, improving both safety and efficiency. In simulation, OnFly improves task success from 26.4% to 67.8%, compared with the strongest state-of-the-art baseline, while fully onboard real-world flights validate its feasibility for real-time deployment. The code will be released at https://github.com/Robotics-STAR-Lab/OnFly
翻译:航空视觉语言导航(AVLN)使无人机能够在复杂三维环境中遵循自然语言指令。然而,现有零样本AVLN方法常面临单流视觉语言模型决策不稳定、长时程进度监测不可靠以及安全与效率难以兼顾的问题。本文提出OnFly,一个完全机载、实时的零样本AVLN框架。OnFly采用共享感知的双智能体架构,将高频目标生成与低频进度监测解耦,从而稳定决策过程。该框架进一步利用混合关键帧-近期帧记忆机制,在保持KV缓存前缀稳定性的同时保留全局轨迹上下文,实现具备终止与恢复信号的长时程可靠监测。此外,语义-几何验证器通过融合VLM特征与深度信息,对VLM预测目标进行指令一致性与几何安全性优化;而滚动时域规划器在几何安全约束下生成优化的无碰撞轨迹,同步提升安全性与效率。仿真实验中,OnFly将任务成功率从最强基线方法的26.4%提升至67.8%,完全机载的实机飞行验证了其实时部署的可行性。代码将在https://github.com/Robotics-STAR-Lab/OnFly发布。