A fundamental question in reinforcement learning theory is: suppose the optimal value functions are linear in given features, can we learn them efficiently? This problem's counterpart in supervised learning, linear regression, can be solved both statistically and computationally efficiently. Therefore, it was quite surprising when a recent work \cite{kane2022computational} showed a computational-statistical gap for linear reinforcement learning: even though there are polynomial sample-complexity algorithms, unless NP = RP, there are no polynomial time algorithms for this setting. In this work, we build on their result to show a computational lower bound, which is exponential in feature dimension and horizon, for linear reinforcement learning under the Randomized Exponential Time Hypothesis. To prove this we build a round-based game where in each round the learner is searching for an unknown vector in a unit hypercube. The rewards in this game are chosen such that if the learner achieves large reward, then the learner's actions can be used to simulate solving a variant of 3-SAT, where (a) each variable shows up in a bounded number of clauses (b) if an instance has no solutions then it also has no solutions that satisfy more than (1-$\epsilon$)-fraction of clauses. We use standard reductions to show this 3-SAT variant is approximately as hard as 3-SAT. Finally, we also show a lower bound optimized for horizon dependence that almost matches the best known upper bound of $\exp(\sqrt{H})$.
翻译:强化学习理论中的一个基本问题是:假设最优值函数在给定特征下是线性的,我们能否高效地学习它们?该问题在监督学习中的对应问题——线性回归,可以在统计和计算上均高效地解决。因此,当近期工作\cite{kane2022computational}揭示线性强化学习中存在计算-统计差距时,这令人相当惊讶:尽管存在多项式样本复杂度的算法,但除非NP = RP,否则该设定下不存在多项式时间算法。本文在其结果基础上,基于随机指数时间假设,证明了线性强化学习在特征维度和时间维度上的指数级计算下界。为此,我们构建了一个轮次游戏,在每轮中学习者在单位超立方体中搜索未知向量。该游戏中的奖励被设计为:若学习者获得高奖励,则其行为可用于模拟求解3-SAT的一种变体,其中(a)每个变量出现在有界数量的子句中,(b)若实例无解,则其也不存在满足超过(1-$\epsilon$)比例子句的解。我们使用标准归约证明此3-SAT变体与3-SAT近似同等困难。最后,我们还给出了针对时间维度优化的下界,该下界几乎匹配已知最优上界$\exp(\sqrt{H})$。