Offline reinforcement learning (RL), where the agent aims to learn the optimal policy based on the data collected by a behavior policy, has attracted increasing attention in recent years. While offline RL with linear function approximation has been extensively studied with optimal results achieved under certain assumptions, many works shift their interest to offline RL with non-linear function approximation. However, limited works on offline RL with non-linear function approximation have instance-dependent regret guarantees. In this paper, we propose an oracle-efficient algorithm, dubbed Pessimistic Nonlinear Least-Square Value Iteration (PNLSVI), for offline RL with non-linear function approximation. Our algorithmic design comprises three innovative components: (1) a variance-based weighted regression scheme that can be applied to a wide range of function classes, (2) a subroutine for variance estimation, and (3) a planning phase that utilizes a pessimistic value iteration approach. Our algorithm enjoys a regret bound that has a tight dependency on the function class complexity and achieves minimax optimal instance-dependent regret when specialized to linear function approximation. Our work extends the previous instance-dependent results within simpler function classes, such as linear and differentiable function to a more general framework.
翻译:离线强化学习(Offline RL)中,智能体旨在基于行为策略收集的数据学习最优策略,近年来受到越来越多的关注。尽管线性函数近似的离线强化学习已得到广泛研究,并在特定假设下取得了最优结果,但许多工作将兴趣转向非线性函数近似的离线强化学习。然而,关于非线性函数近似的离线强化学习,仅有少量工作具备实例依赖的遗憾界保证。本文提出一种名为悲观非线性最小二乘值迭代(PNLSVI)的预言机高效算法,用于非线性函数近似的离线强化学习。我们的算法设计包含三个创新组成部分:(1)基于方差的加权回归方案,可适用于广泛的函数类别;(2)方差估计的子程序;(3)利用悲观值迭代方法的规划阶段。该算法的遗憾界与函数类别复杂度具有紧密依赖关系,并在特化为线性函数近似时达到极小极大最优的实例依赖遗憾界。本研究将先前在线性函数、可微函数等简单函数类别中取得的实例依赖结果拓展至更通用的框架。