We study offline reinforcement learning under $Q^\star$-approximation and partial coverage, a setting that motivates practical algorithms such as Conservative $Q$-Learning (CQL; Kumar et al., 2020) but has received limited theoretical attention. Our work is inspired by the following open question: "Are $Q^\star$-realizability and Bellman completeness sufficient for sample-efficient offline RL under partial coverage?" We answer in the negative via an information-theoretic lower bound. To identify additional structure that enables sample-efficient offline RL under partial coverage, we introduce a general decision-estimation framework, inspired by model-free decision-estimation coefficients (DEC) for online RL (Foster et al., 2023b; Liu et al., 2025b). Our framework decomposes offline RL complexity into decision complexity and value estimation error. This allows modular study of both sub-problems. Our result not only unifies existing results (Chen and Jiang, 2022; Uehara et al., 2023), but further improves and generalizes them. On the decision complexity side, our improvement includes: the first $ε^{-2}$ sample complexity bound for soft $Q$-learning under partial coverage that improves Uehara et al.'s (2023) $ε^{-4}$ bound, the removal of the need for additional online interaction in the value-gap setting of Chen and Jiang (2022), and new learnable settings beyond the above two cases. On the value estimation side, we provide a new characterization of the role of Bellman completeness under partial coverage, and the first characterization of offline learnability for general low-Bellman-rank MDPs (Jiang et al., 2017; Du et al., 2021; Jin et al., 2021). The latter is a canonical online RL setting that has remained unexplored in offline RL except for special cases. As a side contribution, our techniques give the first analysis of CQL in the function approximation setting.
翻译:我们研究了$Q^\star$近似与部分覆盖条件下的离线强化学习——这一设置驱动了保守$Q$学习(CQL; Kumar et al., 2020)等实际算法,但理论关注有限。我们的工作受以下开放问题启发:“在部分覆盖下,$Q^\star$可实现性与贝尔曼完备性是否足以实现样本高效的离线强化学习?”通过信息论下界,我们给出了否定答案。为识别部分覆盖下实现样本高效离线强化学习所需的额外结构,我们引入了一个通用决策-估计框架,灵感来源于在线强化学习的无模型决策-估计系数(DEC; Foster et al., 2023b; Liu et al., 2025b)。该框架将离线强化学习的复杂度分解为决策复杂度和价值估计误差,从而支持对两个子问题的模块化研究。我们的结果不仅统一了现有研究(Chen and Jiang, 2022; Uehara et al., 2023),更对其进行了改进与推广。在决策复杂度方面,我们的改进包括:首次给出部分覆盖下软$Q$学习的$ε^{-2}$样本复杂度界(优于Uehara等人(2023)的$ε^{-4}$界),移除Chen和Jiang(2022)的价值间隙设置中对额外在线交互的需求,以及上述两种情况之外的可学习新场景。在价值估计方面,我们提供了部分覆盖下贝尔曼完备性作用的新刻画,并首次刻画了一般低贝尔曼秩马尔可夫决策过程(Jiang et al., 2017; Du et al., 2021; Jin et al., 2021)的离线可学习性——这是在线强化学习的经典设定,除特殊情况外此前在离线强化学习中尚未被探索。作为附加贡献,我们的技术首次在函数近似设置下完成了CQL的分析。