We seek to understand what facilitates sample-efficient learning from historical datasets for sequential decision-making, a problem that is popularly known as offline reinforcement learning (RL). Further, we are interested in algorithms that enjoy sample efficiency while leveraging (value) function approximation. In this paper, we address these fundamental questions by (i) proposing a notion of data diversity that subsumes the previous notions of coverage measures in offline RL and (ii) using this notion to {unify} three distinct classes of offline RL algorithms based on version spaces (VS), regularized optimization (RO), and posterior sampling (PS). We establish that VS-based, RO-based, and PS-based algorithms, under standard assumptions, achieve \emph{comparable} sample efficiency, which recovers the state-of-the-art sub-optimality bounds for finite and linear model classes with the standard assumptions. This result is surprising, given that the prior work suggested an unfavorable sample complexity of the RO-based algorithm compared to the VS-based algorithm, whereas posterior sampling is rarely considered in offline RL due to its explorative nature. Notably, our proposed model-free PS-based algorithm for offline RL is {novel}, with sub-optimality bounds that are {frequentist} (i.e., worst-case) in nature.
翻译:我们旨在理解如何通过历史数据集实现样本高效的顺序决策学习,这一问题通常被称为离线强化学习。此外,我们关注在利用(价值)函数逼近的同时能够保持样本高效的算法。本文通过以下方式解决这些基本问题:(i) 提出一种数据多样性概念,它涵盖了离线强化学习中先前的覆盖度量概念;(ii) 利用此概念统一了基于版本空间、正则化优化和后验采样这三类不同的离线强化学习算法。我们证明,在标准假设下,基于VS、RO和PS的算法能够实现**可比较**的样本效率,这恢复了对有限和线性模型类别在标准假设下的最新次优性界限。这一结果令人惊讶,因为先前研究表明RO算法相比VS算法具有不利的样本复杂度,而由于后验采样的探索特性,它在离线强化学习中很少被考虑。值得注意的是,我们提出的用于离线强化学习的无模型PS算法是**新颖的**,其次优性界限本质上是**频率学派**(即最坏情况)的。