Efficient Reinforcement Learning for Autonomous Driving with Parameterized Skills and Priors

When autonomous vehicles are deployed on public roads, they will encounter countless and diverse driving situations. Many manually designed driving policies are difficult to scale to the real world. Fortunately, reinforcement learning has shown great success in many tasks by automatic trial and error. However, when it comes to autonomous driving in interactive dense traffic, RL agents either fail to learn reasonable performance or necessitate a large amount of data. Our insight is that when humans learn to drive, they will 1) make decisions over the high-level skill space instead of the low-level control space and 2) leverage expert prior knowledge rather than learning from scratch. Inspired by this, we propose ASAP-RL, an efficient reinforcement learning algorithm for autonomous driving that simultaneously leverages motion skills and expert priors. We first parameterized motion skills, which are diverse enough to cover various complex driving scenarios and situations. A skill parameter inverse recovery method is proposed to convert expert demonstrations from control space to skill space. A simple but effective double initialization technique is proposed to leverage expert priors while bypassing the issue of expert suboptimality and early performance degradation. We validate our proposed method on interactive dense-traffic driving tasks given simple and sparse rewards. Experimental results show that our method can lead to higher learning efficiency and better driving performance relative to previous methods that exploit skills and priors differently. Code is open-sourced to facilitate further research.

翻译：当自动驾驶车辆部署于公共道路时，将面临无数且多样化的驾驶场景。许多人工设计的驾驶策略难以扩展至真实世界。幸运的是，强化学习通过自动试错机制已在众多任务中展现出显著成功。然而，面对交互式密集交通环境中的自动驾驶任务，强化学习智能体要么无法习得合理性能，要么需要大量数据。我们的洞见在于：人类学习驾驶时，会（1）在高层次技能空间而非低层次控制空间进行决策，（2）利用专家先验知识而非从零开始学习。受此启发，我们提出ASAP-RL——一种同时利用运动技能与专家先验知识的高效自动驾驶强化学习算法。首先参数化运动技能，使其具备足够多样性以覆盖各类复杂驾驶场景与情境。提出技能参数逆恢复方法，将专家演示从控制空间转换至技能空间。设计简洁有效的双重初始化技术，在利用专家先验的同时规避专家次优性与早期性能退化问题。在仅提供简单稀疏奖励的交互式密集交通驾驶任务中验证了所提方法。实验结果表明，相较于采用不同技能与先验利用方式的既有方法，我们的方法能实现更高学习效率与更优驾驶性能。代码已开源以促进后续研究。