Vision-Language-Action (VLA) models offer a promising end-to-end paradigm for unmanned aerial vehicles (UAVs) to accomplish complex tasks specified by fine-grained instructions. However, standard supervised fine-tuning (SFT) suffers from data scarcity, limited generalization, and weak supervision for nuanced and complicated human intents. Reinforcement fine-tuning offers a natural way to mitigate these challenges and align policy behaviors with human intents through designable feedback, but applying it to aerial navigation remains challenging due to inefficient exploration in expansive continuous spaces. To address these challenges, we introduce an efficient reinforcement learning (RL) framework for VLA-based aerial navigation. At its core, we propose EG-GRPO (Expert-Guided Group Relative Policy Optimization) to augment online rollouts with few-shot expert data. Additionally, we design a heterogeneous pipeline enabling parallel simulation and inference, which reduces rollout time by 43.5%. Across multiple tasks specified by complex human intents, EG-GRPO improves the success rate to 2.13x that of the SFT baseline, while improving intent alignment performance by 60.9%. These results demonstrate that our framework can move aerial navigation toward precise intent-aligned flight.
翻译:视觉-语言-动作(VLA)模型为无人机执行细粒度指令指定的复杂任务提供了一种有前景的端到端范式。然而,标准监督微调(SFT)存在数据稀缺、泛化能力有限以及针对微妙复杂人类意图的弱监督问题。强化微调通过可设计的反馈机制自然缓解了这些挑战,并使策略行为与人类意图对齐,但由于在广域连续空间中的低效探索,将其应用于空中导航仍面临困难。为解决这些挑战,我们提出了一种面向VLA空中导航的高效强化学习框架。其核心在于提出EG-GRPO(专家引导的群体相对策略优化)方法,通过少量专家数据增强在线 rollout。此外,我们设计了一个异构流水线,支持并行仿真与推理,将 rollout 时间缩短了43.5%。在多个由复杂人类意图指定的任务中,EG-GRPO将成功率提升至SFT基线的2.13倍,同时意图对齐性能提升60.9%。这些结果表明,我们的框架能够推动空中导航迈向精准意图对齐的飞行。