POMDPs capture a broad class of decision making problems, but hardness results suggest that learning is intractable even in simple settings due to the inherent partial observability. However, in many realistic problems, more information is either revealed or can be computed during some point of the learning process. Motivated by diverse applications ranging from robotics to data center scheduling, we formulate a \setting (\setshort) as a POMDP where the latent states are revealed to the learner in hindsight and only during training. We introduce new algorithms for the tabular and function approximation settings that are provably sample-efficient with hindsight observability, even in POMDPs that would otherwise be statistically intractable. We give a lower bound showing that the tabular algorithm is optimal in its dependence on latent state and observation cardinalities.
翻译:部分可观测马尔可夫决策过程(POMDP)涵盖了广泛的一类决策问题,但难度结果表明,由于固有的部分可观测性,即使在简单设定下,学习也是难以实现的。然而,在许多实际问题中,更多信息要么在学习的某个阶段被揭示,要么可以被计算出来。受从机器人到数据中心调度等多种应用的启发,我们提出了一种设定(\setshort),将其视为一种POMDP,其中潜在状态在后见之明下向学习者揭示,且仅发生在训练过程中。我们针对表格和函数逼近设定引入了新算法,这些算法在后见可观测性下被证明是样本高效的,即使在那些原本在统计上难以处理的POMDP中也是如此。我们给出了一个下界,表明表格算法在依赖潜在状态和观测基数方面是最优的。