This paper addresses the problem of maintaining safety during training in Reinforcement Learning (RL), such that the safety constraint violations are bounded at any point during learning. In a variety of RL applications the safety of the agent is particularly important, e.g. autonomous platforms or robots that work in proximity of humans. As enforcing safety during training might severely limit the agent's exploration, we propose here a new architecture that handles the trade-off between efficient progress and safety during exploration. As the exploration progresses, we update via Bayesian inference Dirichlet-Categorical models of the transition probabilities of the Markov decision process that describes the environment dynamics. This paper proposes a way to approximate moments of belief about the risk associated to the action selection policy. We construct those approximations, and prove the convergence results. We propose a novel method for leveraging the expectation approximations to derive an approximate bound on the confidence that the risk is below a certain level. This approach can be easily interleaved with RL and we present experimental results to showcase the performance of the overall architecture.
翻译:本文研究了强化学习训练过程中维持安全性的问题,要求在学习的任意阶段均能限制安全约束违反程度。在诸多强化学习应用中,智能体的安全性尤为关键,例如与人类近距离工作的自主平台或机器人。由于训练过程中强制执行安全性可能严重限制智能体的探索行为,我们提出了一种新型架构,用于权衡探索过程中的效率提升与安全保障。在探索过程中,我们通过贝叶斯推断更新描述环境动态的马尔可夫决策过程的转移概率的狄利克雷-分类模型。本文提出了一种近似估计与动作选择策略相关的风险信念矩的方法。我们构建了这些近似方法并证明了其收敛性。我们提出了一种创新方法,利用期望近似导出风险低于特定水平的置信度的近似界。该方案可便捷地嵌入强化学习框架,并通过实验验证了整体架构的性能表现。