A precondition for the deployment of a Reinforcement Learning agent to a real-world system is to provide guarantees on the learning process. While a learning algorithm will eventually converge to a good policy, there are no guarantees on the performance of the exploratory policies. We study the problem of conservative exploration, where the learner must at least be able to guarantee its performance is at least as good as a baseline policy. We propose the first conservative provably efficient model-free algorithm for policy optimization in continuous finite-horizon problems. We leverage importance sampling techniques to counterfactually evaluate the conservative condition from the data self-generated by the algorithm. We derive a regret bound and show that (w.h.p.) the conservative constraint is never violated during learning. Finally, we leverage these insights to build a general schema for conservative exploration in DeepRL via off-policy policy evaluation techniques. We show empirically the effectiveness of our methods.
翻译:将强化学习智能体部署到现实系统的前提是为学习过程提供保证。尽管学习算法最终会收敛到良好策略,但探索性策略的性能缺乏保障。本文研究保守探索问题,要求学习器至少能保证其性能不低于基线策略。我们提出了首个针对连续有限时域问题的、可证明高效的保守无模型策略优化算法。利用重要性采样技术,从算法自身生成的数据中反事实评估保守约束条件。推导了遗憾界,并证明(高概率地)学习过程中从未违反保守约束。最后,基于这些见解,通过离线策略评估技术构建了深度强化学习中保守探索的通用框架。实验结果验证了本方法的有效性。