Safe exploration is a key requirement for reinforcement learning (RL) agents to learn and adapt online, beyond controlled (e.g. simulated) environments. In this work, we tackle this challenge by utilizing suboptimal yet conservative policies (e.g., obtained from offline data or simulators) as priors. Our approach, SOOPER, uses probabilistic dynamics models to optimistically explore, yet pessimistically fall back to the conservative policy prior if needed. We prove that SOOPER guarantees safety throughout learning, and establish convergence to an optimal policy by bounding its cumulative regret. Extensive experiments on key safe RL benchmarks and real-world hardware demonstrate that SOOPER is scalable, outperforms the state-of-the-art and validate our theoretical guarantees in practice.
翻译:安全探索是强化学习(RL)智能体在受控环境(如仿真环境)之外进行在线学习与适应的关键要求。在本工作中,我们通过利用次优但保守的策略(例如从离线数据或仿真器获得)作为先验来应对这一挑战。我们的方法SOOPER使用概率动力学模型进行乐观探索,同时在需要时悲观地回退至保守的策略先验。我们证明SOOPER能够保证整个学习过程的安全性,并通过界定其累积遗憾来确立其收敛至最优策略的性质。在关键的安全RL基准测试和真实硬件上进行的大量实验表明,SOOPER具有良好的可扩展性,其性能优于现有最优方法,并在实践中验证了我们的理论保证。