Offline reinforcement learning (RL) aims at learning an optimal strategy using a pre-collected dataset without further interactions with the environment. While various algorithms have been proposed for offline RL in the previous literature, the minimax optimality has only been (nearly) established for tabular Markov decision processes (MDPs). In this paper, we focus on offline RL with linear function approximation and propose a new pessimism-based algorithm for offline linear MDP. At the core of our algorithm is the uncertainty decomposition via a reference function, which is new in the literature of offline RL under linear function approximation. Theoretical analysis demonstrates that our algorithm can match the performance lower bound up to logarithmic factors. We also extend our techniques to the two-player zero-sum Markov games (MGs), and establish a new performance lower bound for MGs, which tightens the existing result, and verifies the nearly minimax optimality of the proposed algorithm. To the best of our knowledge, these are the first computationally efficient and nearly minimax optimal algorithms for offline single-agent MDPs and MGs with linear function approximation.
翻译:离线强化学习旨在利用预先收集的数据集,在不与环境进一步交互的情况下学习最优策略。尽管以往文献已提出多种离线强化学习算法,但其极小极大最优性仅对表格型马尔可夫决策过程(MDP)取得(近似)成立。本文聚焦于线性函数近似下的离线强化学习,提出一种基于悲观原则的新算法用于离线线性MDP。该算法的核心在于通过参考函数进行不确定性分解,这是线性函数近似离线强化学习文献中的新方法。理论分析表明,该算法在性能下界上可实现对数因子的最优逼近。我们还将相关技术扩展至双人零和马尔可夫博弈(MG),并建立了MG的新的性能下界,该结果改进了现有结论,同时验证了所提算法的近似极小极大最优性。据我们所知,这是首个计算高效且对线性函数近似下离线单智能体MDP和MG实现近似极小极大最优的算法。