Partially observable Markov decision processes (POMDPs) have been widely applied to capture many real-world applications. However, existing theoretical results have shown that learning in general POMDPs could be intractable, where the main challenge lies in the lack of latent state information. A key fundamental question here is how much hindsight state information (HSI) is sufficient to achieve tractability. In this paper, we establish a lower bound that reveals a surprising hardness result: unless we have full HSI, we need an exponentially scaling sample complexity to obtain an $\epsilon$-optimal policy solution for POMDPs. Nonetheless, from the key insights in our lower-bound construction, we find that there exist important tractable classes of POMDPs even with partial HSI. In particular, for two novel classes of POMDPs with partial HSI, we provide new algorithms that are shown to be near-optimal by establishing new upper and lower bounds.
翻译:部分可观察马尔可夫决策过程(POMDP)已被广泛应用于建模众多现实场景。然而现有理论结果表明,通用POMDP的学习可能难以实现,其主要挑战在于潜在状态信息的缺失。一个关键的基础性问题是:需要多少事后状态信息(HSI)才能实现可解性?本文建立的下界揭示了一个令人惊讶的困难结论:除非拥有完整HSI,否则POMDP需要指数级增长的样本复杂度才能获得ε-最优策略解。尽管如此,从下界构造的关键洞察中,我们发现存在重要的POMDP可解类别,即便仅具备部分HSI。具体而言,针对两类具有部分HSI的新型POMDP,我们提出了新算法,并通过建立新的上界与下界证明了其近似最优性。