We study optimality for the safety-constrained Markov decision process which is the underlying framework for safe reinforcement learning. Specifically, we consider a constrained Markov decision process (with finite states and finite actions) where the goal of the decision maker is to reach a target set while avoiding an unsafe set(s) with certain probabilistic guarantees. Therefore the underlying Markov chain for any control policy will be multichain since by definition there exists a target set and an unsafe set. The decision maker also has to be optimal (with respect to a cost function) while navigating to the target set. This gives rise to a multi-objective optimization problem. We highlight the fact that Bellman's principle of optimality may not hold for constrained Markov decision problems with an underlying multichain structure (as shown by the counterexample due to Haviv. We resolve the counterexample by formulating the aforementioned multi-objective optimization problem as a zero-sum game and thereafter construct an asynchronous value iteration scheme for the Lagrangian (similar to Shapley's algorithm). Finally, we consider the reinforcement learning problem for the same and construct a modified $Q$-learning algorithm for learning the Lagrangian from data. We also provide a lower bound on the number of iterations required for learning the Lagrangian and corresponding error bounds.
翻译:我们研究安全约束马尔可夫决策过程(安全强化学习的基础框架)的最优性问题。具体而言,我们考虑一个有限状态有限动作的约束马尔可夫决策过程,其中决策者的目标是在避免不安全集(具有特定概率保证)的同时到达目标集。由于定义上存在目标集与不安全集,因此任何控制策略下的底层马尔可夫链具有多链结构。决策者在导航至目标集的过程中还需满足成本函数的最优性,这构成了一个多目标优化问题。我们指出:对于具有多链结构的约束马尔可夫决策问题,贝尔曼最优性原理可能不成立(如Haviv反例所示)。通过将该多目标优化问题建模为零和博弈,并构建拉格朗日函数的异步值迭代方案(类似Shapley算法),我们解决了该反例。最后,我们针对同一问题考虑强化学习场景,构建了改进的$Q$-学习算法以从数据中学习拉格朗日函数,同时给出了学习该函数所需迭代次数的下界及相应的误差界。