Exploring and Learning in Sparse Linear MDPs without Computationally Intractable Oracles

The key assumption underlying linear Markov Decision Processes (MDPs) is that the learner has access to a known feature map $\phi(x, a)$ that maps state-action pairs to $d$-dimensional vectors, and that the rewards and transitions are linear functions in this representation. But where do these features come from? In the absence of expert domain knowledge, a tempting strategy is to use the ``kitchen sink" approach and hope that the true features are included in a much larger set of potential features. In this paper we revisit linear MDPs from the perspective of feature selection. In a $k$-sparse linear MDP, there is an unknown subset $S \subset [d]$ of size $k$ containing all the relevant features, and the goal is to learn a near-optimal policy in only poly$(k,\log d)$ interactions with the environment. Our main result is the first polynomial-time algorithm for this problem. In contrast, earlier works either made prohibitively strong assumptions that obviated the need for exploration, or required solving computationally intractable optimization problems. Along the way we introduce the notion of an emulator: a succinct approximate representation of the transitions that suffices for computing certain Bellman backups. Since linear MDPs are a non-parametric model, it is not even obvious whether polynomial-sized emulators exist. We show that they do exist and can be computed efficiently via convex programming. As a corollary of our main result, we give an algorithm for learning a near-optimal policy in block MDPs whose decoding function is a low-depth decision tree; the algorithm runs in quasi-polynomial time and takes a polynomial number of samples. This can be seen as a reinforcement learning analogue of classic results in computational learning theory. Furthermore, it gives a natural model where improving the sample complexity via representation learning is computationally feasible.

翻译：线性马尔可夫决策过程（MDPs）的核心假设是，学习者能够访问一个已知的特征映射 $\phi(x, a)$，该映射将状态-动作对映射为 $d$ 维向量，并且奖励和转移函数在该表示下是线性的。然而，这些特征从何而来？在缺乏专家领域知识的情况下，一种诱人的策略是采用"大杂烩"方法，寄希望于真实特征包含在更大的候选特征集中。本文从特征选择的角度重新审视线性MDPs。在一个 $k$ 稀疏线性MDP中，存在一个未知子集 $S \subset [d]$（大小为 $k$）包含所有相关特征，目标是在仅与环境进行 poly$(k,\log d)$ 次交互的情况下学习近优策略。我们的主要结果是该问题的首个多项式时间算法。相比之下，早期工作要么做出了过于强烈的假设而规避了探索需求，要么需要求解计算上难解的优化问题。在实现过程中，我们引入了仿真器（emulator）的概念：一种能够充分近似转移过程的简洁表示，足以用于计算某些贝尔曼备份。由于线性MDP是非参数化模型，甚至多项式规模的仿真器是否存在都不明显。我们证明它们确实存在，并可通过凸规划高效计算。作为主要结果的推论，我们给出了一种算法，用于学习解码函数为低深度决策树的块状MDP中的近优策略，该算法以拟多项式时间运行并采用多项式数量样本。这可视为计算学习理论中经典结果在强化学习领域的类比。此外，它提供了一个自然模型，其中通过表示学习提升样本复杂度在计算上是可行的。