We consider the reinforcement learning problem for the constrained Markov decision process (CMDP), which plays a central role in satisfying safety or resource constraints in sequential learning and decision-making. In this problem, we are given finite resources and a MDP with unknown transition probabilities. At each stage, we take an action, collecting a reward and consuming some resources, all assumed to be unknown and need to be learned over time. In this work, we take the first step towards deriving optimal problem-dependent guarantees for the CMDP problems. We derive a logarithmic regret bound, which translates into a $O(\frac{\kappa}{\epsilon}\cdot\log^2(1/\epsilon))$ sample complexity bound, with $\kappa$ being a problem-dependent parameter, yet independent of $\epsilon$. Our sample complexity bound improves upon the state-of-art $O(1/\epsilon^2)$ sample complexity for CMDP problems established in the previous literature, in terms of the dependency on $\epsilon$. To achieve this advance, we develop a new framework for analyzing CMDP problems. To be specific, our algorithm operates in the primal space and we resolve the primal LP for the CMDP problem at each period in an online manner, with \textit{adaptive} remaining resource capacities. The key elements of our algorithm are: i). an eliminating procedure that characterizes one optimal basis of the primal LP, and; ii) a resolving procedure that is adaptive to the remaining resources and sticks to the characterized optimal basis.
翻译:我们研究约束马尔可夫决策过程(CMDP)的强化学习问题,该问题在序列学习和决策中满足安全性或资源约束方面具有核心作用。在该问题中,我们拥有有限资源和一个转移概率未知的MDP。在每个阶段,我们采取一个动作,获取奖励并消耗部分资源,所有参数均假设未知且需随时间学习。本文首次为CMDP问题推导了最优的问题相关保证。我们得出一个对数遗憾界,其转化为 $O(\frac{\kappa}{\epsilon}\cdot\log^2(1/\epsilon))$ 的样本复杂度界,其中 $\kappa$ 为问题相关参数且与 $\epsilon$ 无关。我们的样本复杂度界在 $\epsilon$ 的依赖关系上超越了先前文献中CMDP问题的最优 $O(1/\epsilon^2)$ 样本复杂度。为实现这一突破,我们开发了分析CMDP问题的新框架。具体而言,我们的算法在原始空间中运行,并以在线方式自适应剩余资源容量,周期性地求解CMDP问题的原始线性规划(LP)。算法的关键要素包括:i) 一种消除过程,用于刻画原始LP的一个最优基;ii) 一种自适应剩余资源并遵循所刻画最优基的求解过程。