Learning in POMDPs is known to be significantly harder than MDPs. In this paper, we consider online learning problem for episodic POMDPs with unknown transition and observation models. We propose a Posterior Sampling-based reinforcement learning algorithm for POMDPs (PS4POMDPs), which is much simpler and more implementable compared to state-of-the-art optimism-based online learning algorithms for POMDPs. We show that the Bayesian regret of the proposed algorithm scales as the square root of the number of episodes, matching the lower bound, and is polynomial in the other parameters. In a general setting, its regret scales exponentially in the horizon length $H$, and we show that this is inevitable by providing a lower bound. However, when the POMDP is undercomplete and weakly revealing (an assumption common in recent literature), we establish a polynomial Bayesian regret bound. We also propose a posterior sampling algorithm for multi-agent POMDPs, and show it too has sublinear regret.
翻译:在部分可观测马尔可夫决策过程(POMDP)中的学习问题已知比马尔可夫决策过程(MDP)困难得多。本文研究了转移模型与观测模型均未知的情节性POMDP在线学习问题。我们提出了一种基于后验采样的POMDP强化学习算法(PS4POMDP),该算法相较于当前最先进的基于乐观主义的POMDP在线学习算法更为简洁且易于实现。理论分析表明,该算法的贝叶斯遗憾量随情节数平方根增长(达到下界),且与其余参数呈多项式关系。在一般设定下,遗憾量随规划时域长度$H$呈指数增长,我们通过给出下界证明此指数依赖不可避免。然而当POMDP满足欠完备性与弱揭示性(近期文献中常见假设)时,我们建立了多项式阶的贝叶斯遗憾上界。此外,我们还将后验采样算法拓展至多智能体POMDP场景,并证明该算法仍具有次线性遗憾。