Reinforcement learning (RL) has emerged as a powerful paradigm for achieving online agile navigation with quadrotors. Despite this success, policies trained via standard RL typically fail to generalize across significant dynamic variations, exhibiting a critical lack of adaptability. This work introduces MAVEN, a meta-RL framework that enables a single policy to achieve robust end-to-end navigation across a wide range of quadrotor dynamics. Our approach features a novel predictive context encoder, which learns to infer a latent representation of the system dynamics from interaction history. We demonstrate our method in agile waypoint traversal tasks under two challenging scenarios: large variations in quadrotor mass and severe single-rotor thrust loss. We leverage a GPU-vectorized simulator to distribute tasks across thousands of parallel environments, overcoming the long training times of meta-RL to converge in less than an hour. Through extensive experiments in both simulation and the real world, we validate that MAVEN achieves superior adaptation and agility. The policy successfully executes zero-shot sim-to-real transfer, demonstrating robust online adaptation by performing high-speed maneuvers despite mass variations of up to 66.7% and single-rotor thrust losses as severe as 70%.
翻译:强化学习已成为实现四旋翼在线敏捷导航的强大范式。尽管取得了这些成功,但通过标准强化学习训练的策略通常无法在显著的动态变化中实现泛化,表现出关键的适应性不足。本文提出MAVEN,一种元强化学习框架,使单一策略能够在广泛的四旋翼动力学范围内实现鲁棒的端到端导航。我们的方法采用了一种新颖的预测上下文编码器,该编码器能够从交互历史中学习推断系统动力学的潜在表示。我们在两种挑战性场景下的敏捷航点穿越任务中验证了该方法:四旋翼质量的大幅变化和严重的单旋翼推力损失。我们利用GPU向量化模拟器将任务分布到数千个并行环境中,克服了元强化学习训练时间长的难题,在不到一小时内实现收敛。通过在仿真和现实世界中的大量实验,我们验证了MAVEN在适应性和敏捷性方面的卓越表现。该策略成功实现了零样本仿真到现实的迁移,在质量变化高达66.7%且单旋翼推力损失严重至70%的情况下,仍能通过执行高速机动动作,展现出鲁棒的在线适应能力。