Offline reinforcement learning (RL) aims to find an optimal policy for sequential decision-making using a pre-collected dataset, without further interaction with the environment. Recent theoretical progress has focused on developing sample-efficient offline RL algorithms with various relaxed assumptions on data coverage and function approximators, especially to handle the case with excessively large state-action spaces. Among them, the framework based on the linear-programming (LP) reformulation of Markov decision processes has shown promise: it enables sample-efficient offline RL with function approximation, under only partial data coverage and realizability assumptions on the function classes, with favorable computational tractability. In this work, we revisit the LP framework for offline RL, and provide a new reformulation that advances the existing results in several aspects, relaxing certain assumptions and achieving optimal statistical rates in terms of sample size. Our key enabler is to introduce proper constraints in the reformulation, instead of using any regularization as in the literature, also with careful choices of the function classes and initial state distributions. We hope our insights bring into light the use of LP formulations and the induced primal-dual minimax optimization, in offline RL.
翻译:离线强化学习旨在利用预先收集的数据集,在不与环境进一步交互的情况下,为序列决策找到最优策略。近期理论进展聚焦于开发样本高效的离线强化学习算法,并在数据覆盖范围和函数逼近器方面采用各种宽松假设,尤其针对状态-动作空间过大的情形。其中,基于马尔可夫决策过程线性规划重构的框架展现出前景:该框架能在仅需部分数据覆盖和函数类可实现性假设的条件下,实现高效离线强化学习与函数逼近,并具有良好的计算可行性。本文重新审视了面向离线强化学习的LP框架,并提出一种新的重构方法,在多个方面推进了现有成果:放宽特定假设,并在样本量维度实现最优统计速率。我们的关键创新在于引入适当的约束条件替代文献中常用的正则化方法,同时精心选择函数类与初始状态分布。期望这些见解能揭示LP重构及其引发的原始-对偶极小极大优化在离线强化学习中的应用价值。