We study optimality for the safety-constrained Markov decision process which is the underlying framework for safe reinforcement learning. Specifically, we consider a constrained Markov decision process (with finite states and finite actions) where the goal of the decision maker is to reach a target set while avoiding an unsafe set(s) with certain probabilistic guarantees. Therefore the underlying Markov chain for any control policy will be multichain since by definition there exists a target set and an unsafe set. The decision maker also has to be optimal (with respect to a cost function) while navigating to the target set. This gives rise to a multi-objective optimization problem. We highlight the fact that Bellman's principle of optimality may not hold for constrained Markov decision problems with an underlying multichain structure (as shown by the counterexample). We resolve the counterexample by formulating the aforementioned multi-objective optimization problem as a zero-sum game and thereafter construct an asynchronous value iteration scheme for the Lagrangian (similar to Shapley's algorithm. Finally, we consider the reinforcement learning problem for the same and construct a modified Q-learning algorithm for learning the Lagrangian from data. We also provide a lower bound on the number of iterations required for learning the Lagrangian and corresponding error bounds.
翻译:我们研究了安全约束马尔可夫决策过程的最优性问题,该过程是安全强化学习的基础框架。具体而言,我们考虑一个有限状态和有限动作的约束马尔可夫决策过程,其中决策者的目标是在避免不安全集合(具有特定概率保证)的前提下到达目标集合。由于定义中同时存在目标集合与不安全集合,因此任何控制策略对应的底层马尔可夫链本质上属于多链结构。决策者还需在导航至目标集合的过程中实现(关于成本函数的)最优性,这构成了一个多目标优化问题。我们指出,在具有底层多链结构的约束马尔可夫决策问题中(如反例所示),贝尔曼最优性原理可能不再成立。通过将上述多目标优化问题建模为零和博弈,我们解决了该反例,并构建了拉格朗日函数的异步值迭代方案(类似于沙普利算法)。最后,我们针对同一问题研究了强化学习算法,构造了一种修正的Q学习算法以从数据中学习拉格朗日函数,同时给出了学习拉格朗日函数所需迭代次数的下界及相应误差界。