Reinforcement learning in partially observed Markov decision processes (POMDPs) faces two challenges. (i) It often takes the full history to predict the future, which induces a sample complexity that scales exponentially with the horizon. (ii) The observation and state spaces are often continuous, which induces a sample complexity that scales exponentially with the extrinsic dimension. Addressing such challenges requires learning a minimal but sufficient representation of the observation and state histories by exploiting the structure of the POMDP. To this end, we propose a reinforcement learning algorithm named Embed to Control (ETC), which learns the representation at two levels while optimizing the policy.~(i) For each step, ETC learns to represent the state with a low-dimensional feature, which factorizes the transition kernel. (ii) Across multiple steps, ETC learns to represent the full history with a low-dimensional embedding, which assembles the per-step feature. We integrate (i) and (ii) in a unified framework that allows a variety of estimators (including maximum likelihood estimators and generative adversarial networks). For a class of POMDPs with a low-rank structure in the transition kernel, ETC attains an $O(1/\epsilon^2)$ sample complexity that scales polynomially with the horizon and the intrinsic dimension (that is, the rank). Here $\epsilon$ is the optimality gap. To our best knowledge, ETC is the first sample-efficient algorithm that bridges representation learning and policy optimization in POMDPs with infinite observation and state spaces.
翻译:部分可观测马尔可夫决策过程(POMDP)中的强化学习面临两大挑战:(i)预测未来常需依赖完整历史信息,导致样本复杂度随时域长度呈指数增长;(ii)观测空间与状态空间通常连续,导致样本复杂度随外部维度呈指数增长。应对这些挑战需要利用POMDP的结构特性,学习观测与状态历史的最小充分表征。为此,我们提出名为"嵌入控制"(ETC)的强化学习算法,该算法在优化策略的同时实现两个层级的表征学习:(i)在单步层面,ETC学习用低维特征表示状态,实现转移核的因子化分解;(ii)在跨步层面,ETC学习用低维嵌入表示完整历史,对单步特征进行聚合。我们将(i)和(ii)整合在统一框架中,支持多种估计器(包括最大似然估计器和生成对抗网络)。对于转移核具有低秩结构的一类POMDP,ETC实现了$O(1/\epsilon^2)$的样本复杂度,该复杂度随时域长度和内在维度(即秩)呈多项式增长,其中$\epsilon$为最优性间隙。据我们所知,ETC是首个在具有无限观测与状态空间的POMDP中桥接表征学习与策略优化且具有样本效率的算法。