In this paper, we study the offline RL problem with linear function approximation. Our main structural assumption is that the MDP has low inherent Bellman error, which stipulates that linear value functions have linear Bellman backups with respect to the greedy policy. This assumption is natural in that it is essentially the minimal assumption required for value iteration to succeed. We give a computationally efficient algorithm which succeeds under a single-policy coverage condition on the dataset, namely which outputs a policy whose value is at least that of any policy which is well-covered by the dataset. Even in the setting when the inherent Bellman error is 0 (termed linear Bellman completeness), our algorithm yields the first known guarantee under single-policy coverage. In the setting of positive inherent Bellman error ${\varepsilon_{\mathrm{BE}}} > 0$, we show that the suboptimality error of our algorithm scales with $\sqrt{\varepsilon_{\mathrm{BE}}}$. Furthermore, we prove that the scaling of the suboptimality with $\sqrt{\varepsilon_{\mathrm{BE}}}$ cannot be improved for any algorithm. Our lower bound stands in contrast to many other settings in reinforcement learning with misspecification, where one can typically obtain performance that degrades linearly with the misspecification error.
翻译:本文研究具有线性函数逼近的离线强化学习问题。我们的主要结构假设是马尔可夫决策过程具有低固有贝尔曼误差,该假设规定线性价值函数关于贪婪策略具有线性贝尔曼备份。这一假设具有自然性,本质上是价值迭代能够成功所需的最小假设。我们提出了一种计算高效的算法,该算法在数据集满足单策略覆盖条件下成功,即能够输出其价值不低于数据集中任何被良好覆盖策略价值的策略。即使在固有贝尔曼误差为0(称为线性贝尔曼完备性)的情况下,我们的算法也首次在单策略覆盖条件下提供了已知的理论保证。在固有贝尔曼误差 ${\varepsilon_{\mathrm{BE}}} > 0$ 的情况下,我们证明算法次优性误差的缩放阶数为 $\sqrt{\varepsilon_{\mathrm{BE}}}$。此外,我们严格证明了 $\sqrt{\varepsilon_{\mathrm{BE}}}$ 的次优性缩放阶数对任何算法均不可改进。我们的下界结果与存在模型误设的强化学习中许多其他场景形成鲜明对比,后者通常能获得与误设误差呈线性关系的性能表现。