Multiple-environment MDPs (MEMDPs) capture finite sets of MDPs that share the states but differ in the transition dynamics. These models form a proper subclass of partially observable MDPs (POMDPs). We consider the synthesis of policies that robustly satisfy an almost-sure reachability property in MEMDPs, that is, one policy that satisfies a property for all environments. For POMDPs, deciding the existence of robust policies is an EXPTIME-complete problem. In this paper, we show that this problem is PSPACE-complete for MEMDPs, while the policies in general require exponential memory. We exploit the theoretical results to develop and implement an algorithm that shows promising results in synthesizing robust policies for various benchmarks.
翻译:多环境马尔可夫决策过程(MEMDP)描述了共享状态但转移动力学不同的有限马尔可夫决策过程(MDP)集合。该模型是部分可观察马尔可夫决策过程(POMDP)的一个真子类。本文针对MEMDP中能鲁棒满足几乎必然可达性属性的策略合成问题展开研究,即寻找能在所有环境中均满足该属性的单一策略。对于POMDP而言,判定鲁棒策略是否存在是EXPTIME完全问题。本文证明该问题对MEMDP为PSPACE完全问题,且所需策略一般需要指数级存储空间。我们利用这些理论结果开发并实现了相应算法,在多个基准测试中展示了合成鲁棒策略的显著成效。