Solving partially observable Markov decision processes (POMDPs) requires computing policies under imperfect state information. Despite recent advances, the scalability of existing POMDP solvers remains limited. Moreover, many settings require a policy that is robust across multiple POMDPs, further aggravating the scalability issue. We propose the Lexpop framework for POMDP solving. Lexpop (1) employs deep reinforcement learning to train a neural policy, represented by a recurrent neural network, and (2) constructs a finite-state controller mimicking the neural policy through efficient extraction methods. Crucially, unlike neural policies, such controllers can be formally evaluated, providing performance guarantees. We extend Lexpop to compute robust policies for hidden-model POMDPs (HM-POMDPs), which describe finite sets of POMDPs. We associate every extracted controller with its worst-case POMDP. Using a set of such POMDPs, we iteratively train a robust neural policy and consequently extract a robust controller. Our experiments show that on problems with large state spaces, Lexpop outperforms state-of-the-art solvers for POMDPs as well as HM-POMDPs.
翻译:求解部分可观测马尔可夫决策过程需要在状态信息不完全的情况下计算策略。尽管近期取得进展,现有POMDP求解器的可扩展性仍然有限。此外,许多场景要求策略在多个POMDP之间具有鲁棒性,这进一步加剧了可扩展性问题。我们提出了用于POMDP求解的Lexpop框架。Lexpop(1)采用深度强化学习训练由循环神经网络表示的神经策略,(2)通过高效的提取方法构建模仿该神经策略的有限状态控制器。关键的是,与神经策略不同,此类控制器可以进行形式化评估,从而提供性能保证。我们将Lexpop扩展至计算隐模型POMDP的鲁棒策略,该模型描述了POMDP的有限集合。我们将每个提取的控制器与其最坏情况下的POMDP相关联。利用一组此类POMDP,我们迭代地训练一个鲁棒的神经策略,并进而提取一个鲁棒控制器。我们的实验表明,在具有大规模状态空间的问题上,Lexpop在POMDP及HM-POMDP求解方面均优于当前最先进的求解器。