POMDPs capture a broad class of decision making problems, but hardness results suggest that learning is intractable even in simple settings due to the inherent partial observability. However, in many realistic problems, more information is either revealed or can be computed during some point of the learning process. Motivated by diverse applications ranging from robotics to data center scheduling, we formulate a Hindsight Observable Markov Decision Process (HOMDP) as a POMDP where the latent states are revealed to the learner in hindsight and only during training. We introduce new algorithms for the tabular and function approximation settings that are provably sample-efficient with hindsight observability, even in POMDPs that would otherwise be statistically intractable. We give a lower bound showing that the tabular algorithm is optimal in its dependence on latent state and observation cardinalities.
翻译:部分可观测马尔可夫决策过程(POMDP)涵盖了一类广泛的决策问题,但困难结果提示,即便在简单场景下,由于固有的部分可观测性,学习也是难以处理的。然而,在许多现实问题中,更多信息会在学习过程的某个时刻被揭示或可通过计算获得。受从机器人技术到数据中心调度等多种应用的启发,我们将后见可观测马尔可夫决策过程(HOMDP)定义为一种POMDP,其中潜在状态在事后且仅在训练期间向学习者揭示。我们针对表格和函数逼近设置提出了新算法,这些算法凭借后见可观测性在统计上可证明具有样本高效性,即便在原本统计上难以处理的POMDP中也是如此。我们给出了一个下界,表明该表格算法在依赖于潜在状态和观测基数方面是最优的。