We study optimality for the safety-constrained Markov decision process which is the underlying framework for safe reinforcement learning. Specifically, we consider a constrained Markov decision process (with finite states and finite actions) where the goal of the decision maker is to reach a target set while avoiding an unsafe set(s) with certain probabilistic guarantees. Therefore the underlying Markov chain for any control policy will be multichain since by definition there exists a target set and an unsafe set. The decision maker also has to be optimal (with respect to a cost function) while navigating to the target set. This gives rise to a multi-objective optimization problem. We highlight the fact that Bellman's principle of optimality may not hold for constrained Markov decision problems with an underlying multichain structure (as shown by the counterexample). We resolve the counterexample by formulating the aforementioned multi-objective optimization problem as a zero-sum game and thereafter construct an asynchronous value iteration scheme for the Lagrangian (similar to Shapley's algorithm. Finally, we consider the reinforcement learning problem for the same and construct a modified Q-learning algorithm for learning the Lagrangian from data. We also provide a lower bound on the number of iterations required for learning the Lagrangian and corresponding error bounds.
翻译:我们研究安全约束马尔可夫决策过程的最优性问题,该框架是安全强化学习的基础。具体而言,我们考虑一个约束马尔可夫决策过程(具有有限状态和有限动作),其中决策者的目标是在避免不安全集合(具有特定概率保证)的同时到达目标集合。因此,任何控制策略下的底层马尔可夫链均为多链结构,因为根据定义存在目标集合与不安全集合。决策者还需在导航至目标集合的过程中实现最优性(关于某一成本函数),这构成了多目标优化问题。我们强调,对于具有底层多链结构的约束马尔可夫决策问题,贝尔曼最优性原理可能不成立(如反例所示)。通过将上述多目标优化问题建模为零和博弈,我们解决了该反例,并构建了拉格朗日函数的异步值迭代方案(类似于Shapley算法)。最后,我们针对相同问题考虑强化学习,并构建了改进的Q-学习算法以从数据中学习拉格朗日函数。我们还给出了学习拉格朗日函数所需迭代次数的下界及相应的误差界。