In many problems, it is desirable to optimize an objective function while imposing constraints on some other aspect of the problem. A Constrained Partially Observable Markov Decision Process (C-POMDP) allows modelling of such problems while subject to transition uncertainty and partial observability. Typically, the constraints in C-POMDPs enforce a threshold on expected cumulative costs starting from an initial state distribution. In this work, we first show that optimal C-POMDP policies may violate Bellman's principle of optimality and thus may exhibit pathological behaviors, which can be undesirable for many applications. To address this drawback, we introduce a new formulation, the Recursively-Constrained POMDP (RC-POMDP), that imposes additional history dependent cost constraints on the C-POMDP. We show that, unlike C-POMDPs, RC-POMDPs always have deterministic optimal policies, and that optimal policies obey Bellman's principle of optimality. We also present a point-based dynamic programming algorithm that synthesizes optimal policies for RC-POMDPs. In our evaluations, we show that policies for RC-POMDPs produce more desirable behavior than policies for C-POMDPs and demonstrate the efficacy of our algorithm across a set of benchmark problems.
翻译:在许多问题中,需要在优化目标函数的同时对其他方面施加约束。约束部分可观测马尔可夫决策过程(C-POMDP)允许在面临转移不确定性和部分可观测性时对此类问题进行建模。通常,C-POMDP中的约束对从初始状态分布开始的期望累积代价施加阈值限制。本文首先证明最优C-POMDP策略可能违反贝尔曼最优性原理,从而可能展现出病态行为,这在许多应用中是不理想的。为解决这一缺陷,我们提出一种新形式——递归约束部分可观测马尔可夫决策过程(RC-POMDP),它在C-POMDP基础上增加了额外的历史依赖代价约束。研究表明,与C-POMDP不同,RC-POMDP始终存在确定性最优策略,且最优策略遵循贝尔曼最优性原理。我们还提出一种基于点的动态规划算法,用于合成RC-POMDP的最优策略。实验评估表明,RC-POMDP策略比C-POMDP策略能产生更理想的行为,并在基准问题集上验证了该算法的有效性。