Safe exploration is a key requirement for reinforcement learning (RL) agents to learn and adapt online, beyond controlled (e.g. simulated) environments. In this work, we tackle this challenge by utilizing suboptimal yet conservative policies (e.g., obtained from offline data or simulators) as priors. Our approach, SOOPER, uses probabilistic dynamics models to optimistically explore, yet pessimistically fall back to the conservative policy prior if needed. We prove that SOOPER guarantees safety throughout learning, and establish convergence to an optimal policy by bounding its cumulative regret. Extensive experiments on key safe RL benchmarks and real-world hardware demonstrate that SOOPER is scalable, outperforms the state-of-the-art and validate our theoretical guarantees in practice.
翻译:安全探索是强化学习(RL)智能体在受控环境(例如模拟环境)之外进行在线学习和适应的关键要求。在本工作中,我们通过利用次优但保守的策略(例如,从离线数据或模拟器中获得)作为先验来解决这一挑战。我们的方法 SOOPER 使用概率动力学模型进行乐观探索,同时在必要时悲观地回退到保守的策略先验。我们证明了 SOOPER 在整个学习过程中保证安全性,并通过界定其累积遗憾来确立其收敛于最优策略。在关键的安全 RL 基准测试和真实世界硬件上进行的大量实验表明,SOOPER 具有可扩展性,性能优于现有最先进方法,并在实践中验证了我们的理论保证。