This paper marries two state-of-the-art controller synthesis methods for partially observable Markov decision processes (POMDPs), a prominent model in sequential decision making under uncertainty. A central issue is to find a POMDP controller - that solely decides based on the observations seen so far - to achieve a total expected reward objective. As finding optimal controllers is undecidable, we concentrate on synthesising good finite-state controllers (FSCs). We do so by tightly integrating two modern, orthogonal methods for POMDP controller synthesis: a belief-based and an inductive approach. The former method obtains an FSC from a finite fragment of the so-called belief MDP, an MDP that keeps track of the probabilities of equally observable POMDP states. The latter is an inductive search technique over a set of FSCs, e.g., controllers with a fixed memory size. The key result of this paper is a symbiotic anytime algorithm that tightly integrates both approaches such that each profits from the controllers constructed by the other. Experimental results indicate a substantial improvement in the value of the controllers while significantly reducing the synthesis time and memory footprint.
翻译:本文融合了两种针对部分可观测马尔可夫决策过程(POMDPs)的最先进控制器综合方法,POMDP是不确定性下序贯决策中的一种重要模型。核心问题在于找到一种POMDP控制器——仅基于迄今所观测到的信息进行决策——以实现总期望回报目标。由于寻找最优控制器是不可判定的,我们专注于综合高质量的有限状态控制器(FSCs)。为此,我们紧密集成了两种现代且正交的POMDP控制器综合方法:基于信念的方法和归纳式方法。前者通过所谓的信念MDP的有限片段获取FSC,信念MDP是一种跟踪等可观测POMDP状态概率的MDP。后者则是对一组FSCs(例如具有固定记忆大小的控制器)进行归纳搜索的技术。本文的关键成果是一种共生型随时算法,该算法紧密集成两种方法,使得每种方法都能从另一方构建的控制器中获益。实验结果表明,该方法在显著缩短综合时间和降低内存占用的同时,大幅提升了控制器的价值。