In offline reinforcement learning (RL) we have no opportunity to explore so we must make assumptions that the data is sufficient to guide picking a good policy, taking the form of assuming some coverage, realizability, Bellman completeness, and/or hard margin (gap). In this work we propose value-based algorithms for offline RL with PAC guarantees under just partial coverage, specifically, coverage of just a single comparator policy, and realizability of soft (entropy-regularized) Q-function of the single policy and a related function defined as a saddle point of certain minimax optimization problem. This offers refined and generally more lax conditions for offline RL. We further show an analogous result for vanilla Q-functions under a soft margin condition. To attain these guarantees, we leverage novel minimax learning algorithms to accurately estimate soft or vanilla Q-functions with $L^2$-convergence guarantees. Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying.
翻译:在离线强化学习中,由于无法进行探索,必须假设数据足以指导选择最优策略,具体表现为假设存在某种覆盖性、可实现性、贝尔曼完备性和/或硬间隔(差距条件)。本文提出基于值函数的离线强化学习算法,在仅需部分覆盖(即仅需覆盖单一比较策略)的条件下,结合软(熵正则化)Q函数以及由特定极小极大优化问题鞍点定义的关联函数的可实现性假设,可提供PAC理论保证。这为离线强化学习提供了更精化且通常更宽松的适用条件。我们进一步证明,在软间隔条件下,标准Q函数也存在类似结论。为达成这些保证,我们利用新型极小极大学习算法,以$L^2$收敛性保证准确估计软/标准Q函数。这些算法的损失函数源自将估计问题转化为非线性凸优化问题并通过拉格朗日乘子法求解。