Constrained Markov decision processes (CMDPs) model scenarios of sequential decision making with multiple objectives that are increasingly important in many applications. However, the model is often unknown and must be learned online while still ensuring the constraint is met, or at least the violation is bounded with time. Some recent papers have made progress on this very challenging problem but either need unsatisfactory assumptions such as knowledge of a safe policy, or have high cumulative regret. We propose the Safe PSRL (posterior sampling-based RL) algorithm that does not need such assumptions and yet performs very well, both in terms of theoretical regret bounds as well as empirically. The algorithm achieves an efficient tradeoff between exploration and exploitation by use of the posterior sampling principle, and provably suffers only bounded constraint violation by leveraging the idea of pessimism. Our approach is based on a primal-dual approach. We establish a sub-linear $\tilde{\mathcal{ O}}\left(H^{2.5} \sqrt{|\mathcal{S}|^2 |\mathcal{A}| K} \right)$ upper bound on the Bayesian reward objective regret along with a bounded, i.e., $\tilde{\mathcal{O}}\left(1\right)$ constraint violation regret over $K$ episodes for an $|\mathcal{S}|$-state, $|\mathcal{A}|$-action and horizon $H$ CMDP.
翻译:约束马尔可夫决策过程(CMDP)建模了包含多目标的序贯决策场景,该模型在众多应用中日益重要。然而,模型本身通常未知且必须在线学习,同时仍需确保满足约束条件,或至少随时间推移其违反程度有界。近期研究在该极具挑战性的问题上取得进展,但均存在不足:要么需要假设已知安全策略等不切实际的条件,要么面临高累积遗憾。本文提出无需此类假设的安全后验采样强化学习(Safe PSRL)算法,其在理论遗憾界和实际表现两方面均展现出优异性能。该算法通过后验采样原理实现探索与利用的高效平衡,并借助悲观主义思想确保仅产生有界约束违反。我们的方法基于原始-对偶框架。针对状态空间大小为$|\mathcal{S}|$、动作空间大小为$|\mathcal{A}|$、决策时域为$H$的CMDP,在$K$个回合中,我们证明了贝叶斯奖励目标遗憾的次线性上界$\tilde{\mathcal{O}}\left(H^{2.5} \sqrt{|\mathcal{S}|^2 |\mathcal{A}| K} \right)$,同时约束违反遗憾保持有界,即$\tilde{\mathcal{O}}\left(1\right)$。