In this paper, we study representation learning in partially observable Markov Decision Processes (POMDPs), where the agent learns a decoder function that maps a series of high-dimensional raw observations to a compact representation and uses it for more efficient exploration and planning. We focus our attention on the sub-classes of \textit{$\gamma$-observable} and \textit{decodable POMDPs}, for which it has been shown that statistically tractable learning is possible, but there has not been any computationally efficient algorithm. We first present an algorithm for decodable POMDPs that combines maximum likelihood estimation (MLE) and optimism in the face of uncertainty (OFU) to perform representation learning and achieve efficient sample complexity, while only calling supervised learning computational oracles. We then show how to adapt this algorithm to also work in the broader class of $\gamma$-observable POMDPs.
翻译:本文研究了部分可观测马尔可夫决策过程(POMDPs)中的表示学习问题,其中智能体学习一个解码器函数,将一系列高维原始观测映射为紧凑表示,并利用该表示进行更高效的探索与规划。我们重点关注两类子问题:$\gamma$可观测POMDP和可解码POMDP。已有研究表明,这些类别在统计上可实现可处理的学习,但目前仍缺乏计算高效的算法。我们首先提出一种针对可解码POMDP的算法,该算法结合最大似然估计(MLE)与面对不确定性时的乐观原则(OFU)进行表示学习,并在仅调用监督学习计算范式的前提下实现高效的样本复杂度。随后,我们展示了如何将该算法推广至更广泛的$\gamma$可观测POMDP类。