Robust partially observable Markov decision processes (robust POMDPs) extend classical POMDPs to handle additional uncertainty on the transition and observation probabilities via so-called uncertainty sets. Policies for robust POMDPs must not only be memory-based to account for partial observability but also robust against model uncertainty to account for the worst-case instances from the uncertainty sets. We propose the pessimistic iterative planning (PIP) framework, which finds robust memory-based policies for robust POMDPs. PIP alternates between two main steps: (1) selecting an adversarial (non-robust) POMDP via worst-case probability instances from the uncertainty sets; and (2) computing a finite-state controller (FSC) for this adversarial POMDP. We evaluate the performance of this FSC on the original robust POMDP and use this evaluation in step (1) to select the next adversarial POMDP. Within PIP, we propose the rFSCNet algorithm. In each iteration, rFSCNet finds an FSC through a recurrent neural network by using supervision policies optimized for the adversarial POMDP. The empirical evaluation in four benchmark environments showcases improved robustness against several baseline methods and competitive performance compared to a state-of-the-art robust POMDP solver.
翻译:鲁棒部分可观测马尔可夫决策过程(鲁棒POMDPs)通过引入所谓的不确定性集合,扩展了经典POMDPs,以处理转移概率和观测概率的额外不确定性。针对鲁棒POMDPs的策略不仅必须是基于记忆的(以应对部分可观测性),还必须对模型不确定性具有鲁棒性(以应对来自不确定性集合的最坏情况实例)。我们提出了悲观迭代规划(PIP)框架,该框架可为鲁棒POMDPs寻找鲁棒的、基于记忆的策略。PIP交替执行两个主要步骤:(1)从不确定性集合中选择最坏情况概率实例,从而选定一个对抗性(非鲁棒)POMDP;(2)为该对抗性POMDP计算一个有限状态控制器(FSC)。我们评估此FSC在原始鲁棒POMDP上的性能,并将该评估用于步骤(1)以选择下一个对抗性POMDP。在PIP框架内,我们提出了rFSCNet算法。在每次迭代中,rFSCNet通过循环神经网络,利用为对抗性POMDP优化的监督策略来寻找一个FSC。在四个基准环境中的实证评估表明,相较于多种基线方法,该框架的鲁棒性有所提升,并且与当前最先进的鲁棒POMDP求解器相比,其性能具有竞争力。