Offline reinforcement learning (RL) aims to find an optimal policy for Markov decision processes (MDPs) using a pre-collected dataset. In this work, we revisit the linear programming (LP) reformulation of Markov decision processes for offline RL, with the goal of developing algorithms with optimal $O(1/\sqrt{n})$ sample complexity, where $n$ is the sample size, under partial data coverage and general function approximation, and with favorable computational tractability. To this end, we derive new \emph{error bounds} for both the dual and primal-dual formulations of the LP, and incorporate them properly as \emph{constraints} in the LP reformulation. We then show that under a completeness-type assumption, $O(1/\sqrt{n})$ sample complexity can be achieved under standard single-policy coverage assumption, when one properly \emph{relaxes} the occupancy validity constraint in the LP. This framework can readily handle both infinite-horizon discounted and average-reward MDPs, in both general function approximation and tabular cases. The instantiation to the tabular case achieves either state-of-the-art or the first sample complexities of offline RL in these settings. To further remove any completeness-type assumption, we then introduce a proper \emph{lower-bound constraint} in the LP, and a variant of the standard single-policy coverage assumption. Such an algorithm leads to a $O(1/\sqrt{n})$ sample complexity with dependence on the \emph{value-function gap}, with only realizability assumptions. Our properly constrained LP framework advances the existing results in several aspects, in relaxing certain assumptions and achieving the optimal $O(1/\sqrt{n})$ sample complexity, with simple analyses. We hope our results bring new insights into the use of LP formulations and the equivalent primal-dual minimax optimization for offline RL, through the error-bound induced constraints.
翻译:离线强化学习(RL)旨在利用预先收集的数据集为马尔可夫决策过程(MDP)寻找最优策略。本文重新审视了用于离线RL的马尔可夫决策过程线性规划(LP)重构,目标是在部分数据覆盖和一般函数逼近条件下,开发具有最优 $O(1/\sqrt{n})$ 样本复杂度(其中 $n$ 为样本量)且计算易处理的算法。为此,我们推导了LP对偶形式与原始-对偶形式的新误差界,并将其恰当地作为约束条件纳入LP重构中。随后证明,在完备性类假设下,当在LP中适当松弛占位有效性约束时,可在标准单策略覆盖假设下实现 $O(1/\sqrt{n})$ 的样本复杂度。该框架能直接处理无限时域折扣与平均奖励MDP问题,适用于一般函数逼近与表格化情形。在表格化实例中,本方法在这些设定下取得了当前最优或首次实现的离线RL样本复杂度。为进一步消除完备性类假设,我们在LP中引入了恰当的下界约束及标准单策略覆盖假设的变体。该算法在仅需可实现性假设的条件下,实现了依赖价值函数间隙的 $O(1/\sqrt{n})$ 样本复杂度。我们提出的约束化LP框架在多个方面推进了现有成果:通过松弛特定假设、以简洁分析达成最优 $O(1/\sqrt{n})$ 样本复杂度。我们期望这些研究能通过误差界诱导的约束条件,为离线RL中LP公式及其等效原始-对偶极小化极大优化的应用带来新的启示。