Exploring and Learning in Sparse Linear MDPs without Computationally Intractable Oracles

The key assumption underlying linear Markov Decision Processes (MDPs) is that the learner has access to a known feature map $\phi(x, a)$ that maps state-action pairs to $d$-dimensional vectors, and that the rewards and transitions are linear functions in this representation. But where do these features come from? In the absence of expert domain knowledge, a tempting strategy is to use the ``kitchen sink" approach and hope that the true features are included in a much larger set of potential features. In this paper we revisit linear MDPs from the perspective of feature selection. In a $k$-sparse linear MDP, there is an unknown subset $S \subset [d]$ of size $k$ containing all the relevant features, and the goal is to learn a near-optimal policy in only poly$(k,\log d)$ interactions with the environment. Our main result is the first polynomial-time algorithm for this problem. In contrast, earlier works either made prohibitively strong assumptions that obviated the need for exploration, or required solving computationally intractable optimization problems. Along the way we introduce the notion of an emulator: a succinct approximate representation of the transitions that suffices for computing certain Bellman backups. Since linear MDPs are a non-parametric model, it is not even obvious whether polynomial-sized emulators exist. We show that they do exist and can be computed efficiently via convex programming. As a corollary of our main result, we give an algorithm for learning a near-optimal policy in block MDPs whose decoding function is a low-depth decision tree; the algorithm runs in quasi-polynomial time and takes a polynomial number of samples. This can be seen as a reinforcement learning analogue of classic results in computational learning theory. Furthermore, it gives a natural model where improving the sample complexity via representation learning is computationally feasible.

翻译：线性马尔可夫决策过程（MDPs）的核心假设是：学习器可以访问一个已知的特征映射$\phi(x, a)$，该映射将状态-动作对映射为$d$维向量，且奖励与转移在该表示下均为线性函数。然而这些特征从何而来？在缺乏领域专家知识的情况下，一个诱人的策略是采用"厨房水槽"方法，期望真实特征包含在更大的候选特征集中。本文从特征选择角度重新审视线性MDPs。在$k$-稀疏线性MDP中，存在未知子集$S \subset [d]$（大小为$k$）包含所有相关特征，目标是在与环境仅进行poly$(k,\log d)$次交互后学习到近优策略。我们的主要结果是该问题的首个多项式时间算法。相比之下，早期工作要么做出过强假设而规避了探索需求，要么需要求解计算上难解的优化问题。在此过程中，我们引入了仿真器（emulator）概念：一种足以计算特定贝尔曼备份的转移紧凑近似表示。由于线性MDP属于非参数模型，多项式规模仿真器的存在性本身并不显然。我们证明其存在性，且可通过凸规划高效计算。作为主要结果的推论，我们给出了在解码函数为低深度决策树的块MDP中学习近优策略的算法：该算法以拟多项式时间运行，并采集多项式数量的样本。这可视作计算学习理论经典结果在强化学习中的类比，同时提供了一个自然模型，表明通过表示学习提升样本复杂度在计算上是可行的。