We consider the problem of learning the optimal policy for Markov decision processes with safety constraints. We formulate the problem in a reach-avoid setup. Our goal is to design online reinforcement learning algorithms that ensure safety constraints with arbitrarily high probability during the learning phase. To this end, we first propose an algorithm based on the optimism in the face of uncertainty (OFU) principle. Based on the first algorithm, we propose our main algorithm, which utilizes entropy regularization. We investigate the finite-sample analysis of both algorithms and derive their regret bounds. We demonstrate that the inclusion of entropy regularization improves the regret and drastically controls the episode-to-episode variability that is inherent in OFU-based safe RL algorithms.
翻译:本文研究具有安全约束的马尔可夫决策过程中的最优策略学习问题。我们将该问题表述为到达-规避框架。我们的目标是设计在线强化学习算法,确保在学习阶段以任意高概率满足安全约束。为此,我们首先提出一种基于不确定性乐观原则(OFU)的算法。基于该算法,我们进一步提出利用熵正则化的核心算法。我们对两种算法进行有限样本分析,并推导其遗憾界。研究表明,引入熵正则化能够改善遗憾界,并显著控制基于OFU的安全强化学习算法固有的幕间波动性。