Reinforcement learning (RL) has achieved outstanding success in complex robot control tasks, such as drone racing, where the RL agents have outperformed human champions in a known racing track. However, these agents fail in unseen track configurations, always requiring complete retraining when presented with new track layouts. This work aims to develop RL agents that generalize effectively to novel track configurations without retraining. The naive solution of training directly on a diverse set of track layouts can overburden the agent, resulting in suboptimal policy learning as the increased complexity of the environment impairs the agent's ability to learn to fly. To enhance the generalizability of the RL agent, we propose an adaptive environment-shaping framework that dynamically adjusts the training environment based on the agent's performance. We achieve this by leveraging a secondary RL policy to design environments that strike a balance between being challenging and achievable, allowing the agent to adapt and improve progressively. Using our adaptive environment shaping, one single racing policy efficiently learns to race in diverse challenging tracks. Experimental results validated in both simulation and the real world show that our method enables drones to successfully fly complex and unseen race tracks, outperforming existing environment-shaping techniques. Project page: http://rpg.ifi.uzh.ch/env_as_policy.
翻译:强化学习(RL)在复杂机器人控制任务中取得了卓越成就,例如在无人机竞速领域,RL智能体已在已知赛道上超越了人类冠军选手。然而,这些智能体在面对未知赛道构型时表现不佳,每当出现新的赛道布局时总是需要完全重新训练。本研究旨在开发能够有效泛化至新型赛道构型而无需重新训练的RL智能体。直接在多样化赛道布局集合上进行训练的简单方案会使智能体负担过重,由于环境复杂度的增加损害了智能体学习飞行的能力,导致策略学习效果欠佳。为提升RL智能体的泛化能力,我们提出了一种自适应环境塑造框架,该框架能根据智能体表现动态调整训练环境。我们通过利用辅助RL策略来设计环境,使其在挑战性与可实现性之间取得平衡,从而使智能体能够逐步适应并提升性能。采用我们的自适应环境塑造方法,单一竞速策略即可高效学习在多样化挑战性赛道中竞速。在仿真和真实世界验证的实验结果表明,我们的方法能使无人机成功飞越复杂且未知的竞速赛道,其性能优于现有环境塑造技术。项目页面:http://rpg.ifi.uzh.ch/env_as_policy。