We consider the problem of learning the optimal policy for Markov decision processes with safety constraints. We formulate the problem in a reach-avoid setup. Our goal is to design online reinforcement learning algorithms that ensure safety constraints with arbitrarily high probability during the learning phase. To this end, we first propose an algorithm based on the optimism in the face of uncertainty (OFU) principle. Based on the first algorithm, we propose our main algorithm, which utilizes entropy regularization. We investigate the finite-sample analysis of both algorithms and derive their regret bounds. We demonstrate that the inclusion of entropy regularization improves the regret and drastically controls the episode-to-episode variability that is inherent in OFU-based safe RL algorithms.
翻译:我们研究带安全约束的马尔可夫决策过程的最优策略学习问题。我们在可达规避框架下形式化该问题。目标是设计在线强化学习算法,确保在学习阶段以任意高概率满足安全约束。为此,我们首先提出一种基于不确定性乐观原则的算法。基于该算法,我们进一步提出采用熵正则化的核心算法。我们对两种算法进行有限样本分析,并推导其遗憾界。研究证明,熵正则化的引入能够改善遗憾界,并显著控制基于OFU的安全强化学习算法固有的回合间波动性。