Robust POMDPs extend classical POMDPs to handle model uncertainty. Specifically, robust POMDPs exhibit so-called uncertainty sets on the transition and observation models, effectively defining ranges of probabilities. Policies for robust POMDPs must be (1) memory-based to account for partial observability and (2) robust against model uncertainty to account for the worst-case instances from the uncertainty sets. To compute such robust memory-based policies, we propose the pessimistic iterative planning (PIP) framework, which alternates between two main steps: (1) selecting a pessimistic (non-robust) POMDP via worst-case probability instances from the uncertainty sets; and (2) computing a finite-state controller (FSC) for this pessimistic POMDP. We evaluate the performance of this FSC on the original robust POMDP and use this evaluation in step (1) to select the next pessimistic POMDP. Within PIP, we propose the rFSCNet algorithm. In each iteration, rFSCNet finds an FSC through a recurrent neural network by using supervision policies optimized for the pessimistic POMDP. The empirical evaluation in four benchmark environments showcases improved robustness against several baseline methods and competitive performance compared to a state-of-the-art robust POMDP solver.
翻译:鲁棒部分可观测马尔可夫决策过程(Robust POMDPs)扩展了经典POMDPs以处理模型不确定性。具体而言,鲁棒POMDPs在状态转移和观测模型上引入了所谓的不确定性集合,从而有效定义了概率的取值范围。针对鲁棒POMDPs的策略必须满足两个要求:(1)基于记忆以应对部分可观测性;(2)具备鲁棒性以抵御模型不确定性,从而应对不确定性集合中的最坏情况实例。为计算此类基于记忆的鲁棒策略,我们提出了悲观迭代规划框架,该框架交替执行两个主要步骤:(1)通过从不确定性集合中选取最坏情况概率实例,选择一个悲观的(非鲁棒)POMDP;(2)为该悲观POMDP计算一个有限状态控制器。我们将评估该FSC在原始鲁棒POMDP上的性能,并将评估结果用于步骤(1)以选择下一个悲观POMDP。在PIP框架内,我们提出了rFSCNet算法。在每次迭代中,rFSCNet通过循环神经网络寻找FSC,其监督策略是针对悲观POMDP优化的。在四个基准环境中的实证评估表明,相较于多种基线方法,该框架提升了鲁棒性;与当前最先进的鲁棒POMDP求解器相比,其性能具有竞争力。