Recent advances in high-fidelity simulators have enabled closed-loop training of autonomous driving agents, potentially solving the distribution shift in training v.s. deployment and allowing training to be scaled both safely and cheaply. However, there is a lack of understanding of how to build effective training benchmarks for closed-loop training. In this work, we present the first empirical study which analyzes the effects of different training benchmark designs on the success of learning agents, such as how to design traffic scenarios and scale training environments. Furthermore, we show that many popular RL algorithms cannot achieve satisfactory performance in the context of autonomous driving, as they lack long-term planning and take an extremely long time to train. To address these issues, we propose trajectory value learning (TRAVL), an RL-based driving agent that performs planning with multistep look-ahead and exploits cheaply generated imagined data for efficient learning. Our experiments show that TRAVL can learn much faster and produce safer maneuvers compared to all the baselines. For more information, visit the project website: https://waabi.ai/research/travl
翻译:高保真模拟器的最新进展使得自动驾驶智能体的闭环训练成为可能,这有望解决训练与部署之间的分布偏移问题,并能够以安全且低成本的方式扩大训练规模。然而,目前对于如何构建有效的闭环训练基准尚缺乏深入理解。本文首次通过实证研究分析了不同训练基准设计对学习智能体成功的影响因素,例如如何设计交通场景以及扩展训练环境。此外,我们发现许多流行的强化学习算法在自动驾驶场景中无法达到令人满意的性能,原因在于它们缺乏长期规划能力且训练耗时极长。为解决这些问题,我们提出了轨迹价值学习(TRAVL)——一种基于强化学习的驾驶智能体,它通过多步前瞻进行规划,并利用廉价生成的想象数据实现高效学习。实验表明,与所有基线方法相比,TRAVL能更快地学习并产生更安全的驾驶行为。更多信息请访问项目网站:https://waabi.ai/research/travl