We seek to understand what facilitates sample-efficient learning from historical datasets for sequential decision-making, a problem that is popularly known as offline reinforcement learning (RL). Further, we are interested in algorithms that enjoy sample efficiency while leveraging (value) function approximation. In this paper, we address these fundamental questions by (i) proposing a notion of data diversity that subsumes the previous notions of coverage measures in offline RL and (ii) using this notion to {unify} three distinct classes of offline RL algorithms based on version spaces (VS), regularized optimization (RO), and posterior sampling (PS). We establish that VS-based, RO-based, and PS-based algorithms, under standard assumptions, achieve \emph{comparable} sample efficiency, which recovers the state-of-the-art sub-optimality bounds for finite and linear model classes with the standard assumptions. This result is surprising, given that the prior work suggested an unfavorable sample complexity of the RO-based algorithm compared to the VS-based algorithm, whereas posterior sampling is rarely considered in offline RL due to its explorative nature. Notably, our proposed model-free PS-based algorithm for offline RL is {novel}, with sub-optimality bounds that are {frequentist} (i.e., worst-case) in nature.
翻译:我们试图理解从历史数据集中进行序贯决策的样本高效学习(即常称的离线强化学习)的关键因素。进一步,我们关注在利用(价值)函数近似的同时实现样本高效的算法。本文通过以下方式解决这些基础性问题:(i)提出数据多样性概念,该概念统一了离线强化学习中先前的覆盖度量概念;(ii)利用此概念统一三类截然不同的离线强化学习算法:基于版本空间(VS)、正则化优化(RO)和后验采样(PS)的算法。我们证明,在标准假设下,基于VS、RO和PS的算法能达到可比的样本效率,并获得具有标准假设的有限模型类与线性模型类的最新次优性界。这一结果令人惊讶——先前研究表明,基于RO的算法相比基于VS的算法样本复杂度更差,而后验采样由于具有探索特性,在离线强化学习中鲜被考虑。值得注意的是,我们提出的基于模型无关后验采样的离线强化学习算法具有新颖性,其次优性界是频率学派(即最坏情况)性质的。