Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism

Offline (or batch) reinforcement learning (RL) algorithms seek to learn an optimal policy from a fixed dataset without active data collection. Based on the composition of the offline dataset, two main categories of methods are used: imitation learning which is suitable for expert datasets and vanilla offline RL which often requires uniform coverage datasets. From a practical standpoint, datasets often deviate from these two extremes and the exact data composition is usually unknown a priori. To bridge this gap, we present a new offline RL framework that smoothly interpolates between the two extremes of data composition, hence unifying imitation learning and vanilla offline RL. The new framework is centered around a weak version of the concentrability coefficient that measures the deviation from the behavior policy to the expert policy alone. Under this new framework, we further investigate the question on algorithm design: can one develop an algorithm that achieves a minimax optimal rate and also adapts to unknown data composition? To address this question, we consider a lower confidence bound (LCB) algorithm developed based on pessimism in the face of uncertainty in offline RL. We study finite-sample properties of LCB as well as information-theoretic limits in multi-armed bandits, contextual bandits, and Markov decision processes (MDPs). Our analysis reveals surprising facts about optimality rates. In particular, in all three settings, LCB achieves a faster rate of $1/N$ for nearly-expert datasets compared to the usual rate of $1/\sqrt{N}$ in offline RL, where $N$ is the number of samples in the batch dataset. In the case of contextual bandits with at least two contexts, we prove that LCB is adaptively optimal for the entire data composition range, achieving a smooth transition from imitation learning to offline RL. We further show that LCB is almost adaptively optimal in MDPs.

翻译：离线（或批量）强化学习算法旨在从固定数据集中学习最优策略，而无需主动收集数据。基于离线数据集的构成，主要存在两类方法：适用于专家数据集的模仿学习，以及通常需要均匀覆盖数据集的传统离线强化学习。从实践角度来看，数据集往往偏离这两种极端情况，且数据的具体构成通常事先未知。为弥合这一差距，我们提出了一种新的离线强化学习框架，该框架能在数据构成的两种极端之间平滑插值，从而统一模仿学习与传统离线强化学习。新框架的核心是集中性系数的弱化版本，该系数仅衡量行为策略与专家策略之间的偏差。在此新框架下，我们进一步探讨算法设计问题：能否开发一种同时实现极小化最优速率并适应未知数据构成的算法？为解决此问题，我们考虑基于离线强化学习中面对不确定性时的悲观原则而开发的低置信界算法。我们研究了低置信界算法在多臂老虎机、上下文老虎机和马尔可夫决策过程中的有限样本性质及信息论极限。我们的分析揭示了最优速率方面的惊人事实。特别地，在三种设定中，低置信界算法在近乎专家数据集上实现了$1/N$的更快速率，而离线强化学习通常的速率为$1/\sqrt{N}$（其中$N$为批量数据集中的样本数）。在至少包含两个上下文的上下文老虎机案例中，我们证明低置信界算法在整个数据构成范围内具有自适应最优性，实现了从模仿学习到离线强化学习的平滑过渡。我们进一步证明低置信界算法在马尔可夫决策过程中近乎自适应最优。