Reinforcement learning (RL) has been successfully applied to a variety of robotics applications, where it outperforms classical methods. However, the safety aspect of RL and the transfer to the real world remain an open challenge. A prominent field for tackling this challenge and ensuring the safety of the agents during training and execution is safe reinforcement learning. Safe RL can be achieved through constrained RL and safe exploration approaches. The former learns the safety constraints over the course of training to achieve a safe behavior by the end of training, at the cost of high number of collisions at earlier stages of the training. The latter offers robust safety by enforcing the safety constraints as hard constraints, which prevents collisions but hinders the exploration of the RL agent, resulting in lower rewards and poor performance. To overcome those drawbacks, we propose a novel safety shield, that combines the robustness of the optimization-based controllers with the long prediction capabilities of the RL agents, allowing the RL agent to adaptively tune the parameters of the controller. Our approach is able to improve the exploration of the RL agents for navigation tasks, while minimizing the number of collisions. Experiments in simulation show that our approach outperforms state-of-the-art baselines in the reached goals-to-collisions ratio in different challenging environments. The goals-to-collisions ratio metrics emphasizes the importance of minimizing the number of collisions, while learning to accomplish the task. Our approach achieves a higher number of reached goals compared to the classic safety shields and fewer collisions compared to constrained RL approaches. Finally, we demonstrate the performance of the proposed method in a real-world experiment.
翻译:强化学习(RL)已成功应用于多种机器人学任务,其性能超越了传统方法。然而,强化学习的安全性及其向现实世界的迁移仍是一个开放的挑战。应对这一挑战并确保智能体在训练与执行过程中安全性的一个重要领域是安全强化学习。安全强化学习可通过约束强化学习与安全探索方法实现。前者在训练过程中学习安全约束,以期在训练结束时获得安全行为,但其代价是训练早期阶段的高碰撞次数。后者通过将安全约束作为硬约束来执行,从而提供鲁棒的安全性,这虽能防止碰撞,却阻碍了强化学习智能体的探索,导致奖励降低与性能不佳。为克服这些缺点,我们提出了一种新颖的安全屏障,它将基于优化的控制器的鲁棒性与强化学习智能体的长时预测能力相结合,允许强化学习智能体自适应地调整控制器的参数。我们的方法能够改善强化学习智能体在导航任务中的探索能力,同时最小化碰撞次数。仿真实验表明,在不同挑战性环境中,我们的方法在达成目标与碰撞次数之比这一指标上优于现有先进基线。该指标强调了在学会完成任务的同时最小化碰撞次数的重要性。与经典安全屏障相比,我们的方法达成了更多目标;与约束强化学习方法相比,则实现了更少的碰撞。最后,我们通过真实世界实验展示了所提方法的性能。