Learning in POMDPs is known to be significantly harder than in MDPs. In this paper, we consider the online learning problem for episodic POMDPs with unknown transition and observation models. We propose a Posterior Sampling-based reinforcement learning algorithm for POMDPs (PS4POMDPs), which is much simpler and more implementable compared to state-of-the-art optimism-based online learning algorithms for POMDPs. We show that the Bayesian regret of the proposed algorithm scales as the square root of the number of episodes and is polynomial in the other parameters. In a general setting, the regret scales exponentially in the horizon length $H$, and we show that this is inevitable by providing a lower bound. However, when the POMDP is undercomplete and weakly revealing (a common assumption in the recent literature), we establish a polynomial Bayesian regret bound. We finally propose a posterior sampling algorithm for multi-agent POMDPs, and show it too has sublinear regret.
翻译:众所周知,在部分可观测马尔可夫决策过程(POMDP)中学习比在完全可观测马尔可夫决策过程(MDP)中学习要困难得多。本文研究了转移模型和观测模型均未知的片段式POMDP在线学习问题。我们提出了一种基于后验采样的POMDP强化学习算法(PS4POMDPs),与当前基于乐观估计的POMDP在线学习算法相比,该算法更为简洁且更易于实现。我们证明了所提算法的贝叶斯遗憾随片段数量的平方根增长,并与其他参数呈多项式关系。在一般设定下,遗憾随视野长度 $H$ 呈指数增长,我们通过给出下界证明了这种增长的不可避免性。然而,当POMDP是欠完备且弱可观测的(近期文献中的常见假设)时,我们建立了多项式级别的贝叶斯遗憾上界。最后,我们提出了一种适用于多智能体POMDP的后验采样算法,并证明了其同样具有次线性遗憾。