This paper marries two state-of-the-art controller synthesis methods for partially observable Markov decision processes (POMDPs), a prominent model in sequential decision making under uncertainty. A central issue is to find a POMDP controller - that solely decides based on the observations seen so far - to achieve a total expected reward objective. As finding optimal controllers is undecidable, we concentrate on synthesizing good finite-state controllers (FSCs). We do so by tightly integrating two modern, orthogonal methods for POMDP controller synthesis: a belief-based and an inductive approach. The former method obtains an FSC from a finite fragment of the so-called belief MDP, an MDP that keeps track of the probabilities of equally observable POMDP states. The latter is an inductive search technique over a set of FSCs, e.g., controllers with a fixed memory size. The key result of this paper is a symbiotic anytime algorithm that tightly integrates both approaches such that each profits from the controllers constructed by the other. Experimental results indicate a substantial improvement in the value of the controllers while significantly reducing the synthesis time and memory~footprint.
翻译:本文将部分可观测马尔可夫决策过程(POMDP,一种不确定性序贯决策中的经典模型)的两种先进控制器合成方法相结合。核心问题在于设计一个POMDP控制器——该控制器仅基于当前已观测到的信息进行决策——以实现总期望奖励目标。由于求解最优控制器问题不可判定,我们专注于合成高性能有限状态控制器(FSC)。为此,我们紧密融合了两种现代且正交的POMDP控制器合成方法:基于信念的方法与归纳式方法。前者通过有限片段化所谓的信念马尔可夫决策过程(一种追踪等观测概率POMDP状态的MDP)来获取FSC;后者则是对一组FSC进行归纳搜索,例如针对固定记忆容量的控制器。本文的关键贡献在于提出一种共生型任意时间算法,该算法深度整合两种方法,使每种方法都能从另一种方法构建的控制器中获益。实验结果表明,该方法在显著降低合成时间和内存占用的同时,大幅提升了控制器的性能指标。