Robust partially observable Markov decision processes (robust POMDPs) extend classical POMDPs to handle additional uncertainty on the transition and observation probabilities via so-called uncertainty sets. Policies for robust POMDPs must not only be memory-based to account for partial observability but also robust against model uncertainty to account for the worst-case instances from the uncertainty sets. We propose the pessimistic iterative planning (PIP) framework, which finds robust memory-based policies for robust POMDPs. PIP alternates between two main steps: (1) selecting an adversarial (non-robust) POMDP via worst-case probability instances from the uncertainty sets; and (2) computing a finite-state controller (FSC) for this adversarial POMDP. We evaluate the performance of this FSC on the original robust POMDP and use this evaluation in step (1) to select the next adversarial POMDP. Within PIP, we propose the rFSCNet algorithm. In each iteration, rFSCNet finds an FSC through a recurrent neural network trained using supervision policies optimized for the adversarial POMDP. The empirical evaluation in four benchmark environments showcases improved robustness against a baseline method in an ablation study and competitive performance compared to a state-of-the-art robust POMDP solver.
翻译:鲁棒部分可观测马尔可夫决策过程(robust POMDPs)通过引入不确定性集合,将经典POMDP扩展到能够处理转移概率与观测概率额外不确定性的框架。针对鲁棒POMDP的策略不仅需具备记忆机制以应对部分可观测性,还必须对模型不确定性具有鲁棒性,以应对不确定性集合中的最坏情况实例。本文提出悲观迭代规划框架,该框架可为鲁棒POMDP寻找具备记忆机制的鲁棒策略。PIP交替执行两个核心步骤:(1)从不确定性集合中选取最坏概率实例,构建对抗性(非鲁棒)POMDP;(2)为该对抗性POMDP计算有限状态控制器。我们将此FSC在原始鲁棒POMDP上的性能评估结果反馈至步骤(1),用于选择下一个对抗性POMDP。在PIP框架内,我们提出rFSCNet算法。该算法在每次迭代中通过循环神经网络训练获得FSC,其中监督策略针对对抗性POMDP进行优化。在四个基准环境中的实证评估表明:通过消融实验验证,该方法相较于基线方案展现出更强的鲁棒性;与当前最先进的鲁棒POMDP求解器相比,其性能具有竞争力。