Safe exploration is a key requirement for reinforcement learning (RL) agents to learn and adapt online, beyond controlled (e.g. simulated) environments. In this work, we tackle this challenge by utilizing suboptimal yet conservative policies (e.g., obtained from offline data or simulators) as priors. Our approach, SOOPER, uses probabilistic dynamics models to optimistically explore, yet pessimistically fall back to the conservative policy prior if needed. We prove that SOOPER guarantees safety throughout learning, and establish convergence to an optimal policy by bounding its cumulative regret. Extensive experiments on key safe RL benchmarks and real-world hardware demonstrate that SOOPER is scalable, outperforms the state-of-the-art and validate our theoretical guarantees in practice.
翻译:安全探索是强化学习智能体在受控(如仿真)环境之外进行在线学习与适应的关键要求。在本工作中,我们通过利用次优但保守的策略(例如从离线数据或仿真器中获取)作为先验来应对这一挑战。我们的方法SOOPER采用概率动力学模型进行乐观探索,同时在必要时悲观地退回到保守策略先验。我们证明SOOPER能在整个学习过程中保障安全性,并通过限制其累积遗憾确保收敛至最优策略。在关键安全强化学习基准测试和真实硬件上的大量实验表明,SOOPER具有可扩展性,性能超越当前最优方法,并在实践中验证了我们的理论保证。