Reinforcement learning provides an appealing framework for robotic control due to its ability to learn expressive policies purely through real-world interaction. However, this requires addressing real-world constraints and avoiding catastrophic failures during training, which might severely impede both learning progress and the performance of the final policy. In many robotics settings, this amounts to avoiding certain "unsafe" states. The high-speed off-road driving task represents a particularly challenging instantiation of this problem: a high-return policy should drive as aggressively and as quickly as possible, which often requires getting close to the edge of the set of "safe" states, and therefore places a particular burden on the method to avoid frequent failures. To both learn highly performant policies and avoid excessive failures, we propose a reinforcement learning framework that combines risk-sensitive control with an adaptive action space curriculum. Furthermore, we show that our risk-sensitive objective automatically avoids out-of-distribution states when equipped with an estimator for epistemic uncertainty. We implement our algorithm on a small-scale rally car and show that it is capable of learning high-speed policies for a real-world off-road driving task. We show that our method greatly reduces the number of safety violations during the training process, and actually leads to higher-performance policies in both driving and non-driving simulation environments with similar challenges.
翻译:强化学习因其仅通过真实世界交互即可学习表达性策略的能力,为机器人控制提供了富有吸引力的框架。然而,这需要应对真实世界的约束条件,并在训练过程中避免灾难性故障,否则可能严重阻碍学习进程及最终策略的性能。在许多机器人场景中,这体现为规避某些"不安全"状态。高速越野驾驶任务是该问题的一个极具挑战性的实例:高回报策略应以尽可能激进而快速的方式驾驶,这往往需要接近"安全"状态集的边缘,因此对方法避免频繁失效的能力提出了特殊要求。为同时学习高性能策略并避免过多失效,我们提出了一种结合风险敏感控制与自适应动作空间课程表的强化学习框架。进一步研究表明,当配备认知不确定性估计器时,我们的风险敏感目标能自动规避分布外状态。我们在小型拉力赛车上实现了该算法,并证明其能够为真实世界越野驾驶任务学习高速策略。实验表明,所提方法大幅减少了训练过程中的安全违规次数,且在面临类似挑战的驾驶与非驾驶仿真环境中,最终习得的策略具有更优性能。