Sample-efficient offline reinforcement learning (RL) with linear function approximation has recently been studied extensively. Much of prior work has yielded the minimax-optimal bound of $\tilde{\mathcal{O}}(\frac{1}{\sqrt{K}})$, with $K$ being the number of episodes in the offline data. In this work, we seek to understand instance-dependent bounds for offline RL with function approximation. We present an algorithm called Bootstrapped and Constrained Pessimistic Value Iteration (BCP-VI), which leverages data bootstrapping and constrained optimization on top of pessimism. We show that under a partial data coverage assumption, that of \emph{concentrability} with respect to an optimal policy, the proposed algorithm yields a fast rate of $\tilde{\mathcal{O}}(\frac{1}{K})$ for offline RL when there is a positive gap in the optimal Q-value functions, even when the offline data were adaptively collected. Moreover, when the linear features of the optimal actions in the states reachable by an optimal policy span those reachable by the behavior policy and the optimal actions are unique, offline RL achieves absolute zero sub-optimality error when $K$ exceeds a (finite) instance-dependent threshold. To the best of our knowledge, these are the first $\tilde{\mathcal{O}}(\frac{1}{K})$ bound and absolute zero sub-optimality bound respectively for offline RL with linear function approximation from adaptive data with partial coverage. We also provide instance-agnostic and instance-dependent information-theoretical lower bounds to complement our upper bounds.
翻译:样本高效的离线强化学习(RL)与线性函数逼近近年来得到了广泛研究。先前的大量工作已给出极小化最优界 $\tilde{\mathcal{O}}(\frac{1}{\sqrt{K}})$,其中 $K$ 为离线数据中的回合数。本文旨在理解函数逼近下离线强化学习的实例相关界。我们提出了一种名为"自举约束悲观值迭代"(BCP-VI)的算法,该算法在悲观主义基础上利用数据自举和约束优化。我们证明,在部分数据覆盖假设(即关于最优策略的*可集中性*)下,当最优Q值函数存在正间隙时,即使离线数据是通过自适应方式收集的,所提算法也能为离线RL提供 $\tilde{\mathcal{O}}(\frac{1}{K})$ 的快速收敛率。此外,当最优策略可达到状态中最优动作的线性特征张成行为策略可达到状态的特征空间,且最优动作唯一时,若 $K$ 超过某个(有限的)实例相关阈值,离线RL可实现绝对零次优误差。据我们所知,这是针对具有线性函数逼近的离线RL,在自适应数据部分覆盖场景下,首次分别获得 $\tilde{\mathcal{O}}(\frac{1}{K})$ 界和绝对零次优界。我们还提供了实例无关与实例相关的信息论下界,以补充我们的上界结果。