In this work we address the problem of finding feasible policies for Constrained Markov Decision Processes under probability one constraints. We argue that stationary policies are not sufficient for solving this problem, and that a rich class of policies can be found by endowing the controller with a scalar quantity, so called budget, that tracks how close the agent is to violating the constraint. We show that the minimal budget required to act safely can be obtained as the smallest fixed point of a Bellman-like operator, for which we analyze its convergence properties. We also show how to learn this quantity when the true kernel of the Markov decision process is not known, while providing sample-complexity bounds. The utility of knowing this minimal budget relies in that it can aid in the search of optimal or near-optimal policies by shrinking down the region of the state space the agent must navigate. Simulations illustrate the different nature of probability one constraints against the typically used constraints in expectation.
翻译:本文研究了在概率为一的约束条件下,为约束马尔可夫决策过程寻找可行策略的问题。我们论证了平稳策略不足以解决该问题,并提出可通过赋予控制器一个标量——即所谓的"预算"——来跟踪智能体接近违反约束的程度,从而获得丰富的策略类别。我们展示了确保安全所需的最小预算可通过类贝尔曼算子的最小不动点获得,并分析了其收敛性质。当马尔可夫决策过程的真实核未知时,我们给出了学习该最小预算的方法,同时提供了样本复杂度界限。掌握最小预算的价值在于:它能通过缩小智能体必须遍历的状态空间区域,辅助搜索最优或次优策略。仿真实验说明了概率为一约束与通常使用的期望约束之间的本质差异。