Deploying reinforcement learning agents in the real world can be challenging due to the risks associated with learning through trial and error. We propose a task-agnostic method that leverages small sets of safe and unsafe demonstrations to improve the safety of RL agents during learning. The method compares the current trajectory of the agent with both sets of demonstrations at every step, and filters the trajectory if it resembles the unsafe demonstrations. We perform ablation studies on different filtering strategies and investigate the impact of the number of demonstrations on performance. Our method is compatible with any stand-alone RL algorithm and can be applied to any task. We evaluate our method on three tasks from OpenAI Gym's Mujoco benchmark and two state-of-the-art RL algorithms. The results demonstrate that our method significantly reduces the crash rate of the agent while converging to, and in most cases even improving, the performance of the stand-alone agent.
翻译:在现实世界中部署强化学习智能体时,因通过试错学习带来的风险而面临挑战。我们提出一种任务无关方法,利用少量安全与不安全演示集合来提升强化学习智能体在学习过程中的安全性。该方法在每个时间步将智能体当前轨迹与两组演示进行比对,当轨迹与不安全演示相似时予以过滤。我们针对不同过滤策略进行消融研究,并探究演示数量对性能的影响。该方法兼容任意独立强化学习算法,可适用于任何任务。我们在OpenAI Gym的Mujoco基准测试中的三个任务上,结合两种先进强化学习算法进行评估。结果表明,该方法在保持独立算法性能收敛、多数情况下甚至提升性能的同时,显著降低了智能体的碰撞率。