Existing Unmanned Aerial Vehicle (UAV) Vision-Language Navigation (VLN) datasets face issues such as dependence on virtual environments, lack of naturalness in instructions, and limited scale. To address these challenges, we propose AirNav, a large-scale UAV VLN benchmark constructed from real urban aerial data, rather than synthetic environments, with natural and diverse instructions. Additionally, we introduce the AirVLN-R1, which combines Supervised Fine-Tuning and Reinforcement Fine-Tuning to enhance performance and generalization. The feasibility of the model is preliminarily evaluated through real-world tests. Our dataset and code are publicly available.
翻译:现有无人机视觉与语言导航数据集存在依赖虚拟环境、指令缺乏自然性以及规模有限等问题。为应对这些挑战,我们提出了AirNav——一个基于真实城市航拍数据(而非合成环境)构建的大规模无人机视觉与语言导航基准数据集,其指令具有自然性与多样性。此外,我们提出了AirVLN-R1模型,该模型结合监督微调与强化微调以提升性能与泛化能力。通过真实场景测试对模型的可行性进行了初步验证。我们的数据集与代码均已公开。