In many problems, it is desirable to optimize an objective function while imposing constraints on some other objectives. A Constrained Partially Observable Markov Decision Process (C-POMDP) allows modeling of such problems under transition uncertainty and partial observability. Typically, the constraints in C-POMDPs enforce a threshold on expected cumulative costs starting from an initial state distribution. In this work, we first show that optimal C-POMDP policies may violate Bellman's principle of optimality and thus may exhibit unintuitive behaviors, which can be undesirable for some (e.g., safety critical) applications. Additionally, online re-planning with C-POMDPs is often ineffective due to the inconsistency resulting from the violation of Bellman's principle of optimality. To address these drawbacks, we introduce a new formulation: the Recursively-Constrained POMDP (RC-POMDP), that imposes additional history-dependent cost constraints on the C-POMDP. We show that, unlike C-POMDPs, RC-POMDPs always have deterministic optimal policies, and that optimal policies obey Bellman's principle of optimality. We also present a point-based dynamic programming algorithm that synthesizes admissible near-optimal policies for RC-POMDPs. Evaluations on a set of benchmark problems demonstrate the efficacy of our algorithm and show that policies for RC-POMDPs produce more desirable behaviors than policies for C-POMDPs.
翻译:在许多问题中,需要在优化某个目标函数的同时对其它目标施加约束。约束部分可观测马尔可夫决策过程(C-POMDP)能够在转移不确定性和部分可观测性条件下对此类问题进行建模。通常,C-POMDP中的约束基于初始状态分布对期望累积代价设定阈值。本文首先证明,最优C-POMDP策略可能违背贝尔曼最优性原则,从而产生不符合直觉的行为,这在某些(如安全关键)应用中可能不利。此外,由于违背贝尔曼最优性原则导致的不一致性,C-POMDP的在线重规划通常效果不佳。针对这些缺陷,我们提出一种新框架:递归约束POMDP(RC-POMDP),该框架在C-POMDP基础上施加额外的历史依赖代价约束。研究表明,与C-POMDP不同,RC-POMDP始终存在确定性最优策略,且最优策略遵循贝尔曼最优性原则。我们还提出一种基于点的动态规划算法,可为RC-POMDP合成可接受的近优策略。在基准问题集上的评估表明,该算法具有有效性,且RC-POMDP策略相比C-POMDP策略能产生更理想的行为模式。