We study offline reinforcement learning (RL) with linear MDPs under the infinite-horizon discounted setting which aims to learn a policy that maximizes the expected discounted cumulative reward using a pre-collected dataset. Existing algorithms for this setting either require a uniform data coverage assumptions or are computationally inefficient for finding an $\epsilon$-optimal policy with $O(\epsilon^{-2})$ sample complexity. In this paper, we propose a primal dual algorithm for offline RL with linear MDPs in the infinite-horizon discounted setting. Our algorithm is the first computationally efficient algorithm in this setting that achieves sample complexity of $O(\epsilon^{-2})$ with partial data coverage assumption. Our work is an improvement upon a recent work that requires $O(\epsilon^{-4})$ samples. Moreover, we extend our algorithm to work in the offline constrained RL setting that enforces constraints on additional reward signals.
翻译:本文研究无限时域折扣设定下具有线性马尔可夫决策过程(MDP)的离线强化学习(RL),其目标是通过预收集的数据集学习能够最大化期望折扣累积奖励的策略。该设定下的现有算法要么需要统一的数据覆盖假设,要么在寻找ε最优策略时计算效率低下,其样本复杂度为O(ε⁻²)。本文提出一种适用于无限时域折扣设定下线性MDP离线强化学习的原始对偶算法。我们的算法是该设定下首个在部分数据覆盖假设下达到O(ε⁻²)样本复杂度的计算高效算法。本工作改进了近期一项需要O(ε⁻⁴)样本的研究。此外,我们将算法扩展至离线约束强化学习设定,该设定对附加奖励信号施加约束。