Offline reinforcement learning (RL) aims to learn a policy that maximizes the expected cumulative reward using a pre-collected dataset. Offline RL with low-rank MDPs or general function approximation has been widely studied recently, but existing algorithms with sample complexity $O(\epsilon^{-2})$ for finding an $\epsilon$-optimal policy either require a uniform data coverage assumptions or are computationally inefficient. In this paper, we propose a primal dual algorithm for offline RL with low-rank MDPs in the discounted infinite-horizon setting. Our algorithm is the first computationally efficient algorithm in this setting that achieves sample complexity of $O(\epsilon^{-2})$ with partial data coverage assumption. This improves upon a recent work that requires $O(\epsilon^{-4})$ samples. Moreover, our algorithm extends the previous work to the offline constrained RL setting by supporting constraints on additional reward signals.
翻译:离线强化学习旨在利用预先收集的数据集学习一个能够最大化期望累积奖励的策略。近年来,低秩马尔可夫决策过程或通用函数逼近下的离线强化学习得到了广泛研究,但现有能够以样本复杂度 $O(\epsilon^{-2})$ 找到 $\epsilon$-最优策略的算法,要么需要均匀数据覆盖假设,要么计算效率低下。本文针对折扣无限时域设定下的低秩马尔可夫决策过程离线强化学习,提出了一种原始-对偶算法。该算法是此类设定下首个在部分数据覆盖假设下实现样本复杂度 $O(\epsilon^{-2})$ 的计算高效算法,较近期需要 $O(\epsilon^{-4})$ 样本的工作有所改进。此外,通过支持对额外奖励信号的约束,我们的算法将先前工作扩展至离线约束强化学习场景。