On the Complexity of Offline Reinforcement Learning with $Q^\star$-Approximation and Partial Coverage

We study offline reinforcement learning under $Q^\star$-approximation and partial coverage, a setting that motivates practical algorithms such as Conservative $Q$-Learning (CQL; Kumar et al., 2020) but has received limited theoretical attention. Our work is inspired by the following open question: "Are $Q^\star$-realizability and Bellman completeness sufficient for sample-efficient offline RL under partial coverage?" We answer in the negative via an information-theoretic lower bound. To identify additional structure that enables sample-efficient offline RL under partial coverage, we introduce a general decision-estimation framework, inspired by model-free decision-estimation coefficients (DEC) for online RL (Foster et al., 2023b; Liu et al., 2025b). Our framework decomposes offline RL complexity into decision complexity and value estimation error. This allows modular study of both sub-problems. Our result not only unifies existing results (Chen and Jiang, 2022; Uehara et al., 2023), but further improves and generalizes them. On the decision complexity side, our improvement includes: the first $ε^{-2}$ sample complexity bound for soft $Q$-learning under partial coverage that improves Uehara et al.'s (2023) $ε^{-4}$ bound, the removal of the need for additional online interaction in the value-gap setting of Chen and Jiang (2022), and new learnable settings beyond the above two cases. On the value estimation side, we provide a new characterization of the role of Bellman completeness under partial coverage, and the first characterization of offline learnability for general low-Bellman-rank MDPs (Jiang et al., 2017; Du et al., 2021; Jin et al., 2021). The latter is a canonical online RL setting that has remained unexplored in offline RL except for special cases. As a side contribution, our techniques give the first analysis of CQL in the function approximation setting.

翻译：我们研究了$Q^\star$近似与部分覆盖条件下的离线强化学习——这一设置驱动了保守$Q$学习（CQL; Kumar et al., 2020）等实际算法，但理论关注有限。我们的工作受以下开放问题启发：“在部分覆盖下，$Q^\star$可实现性与贝尔曼完备性是否足以实现样本高效的离线强化学习？”通过信息论下界，我们给出了否定答案。为识别部分覆盖下实现样本高效离线强化学习所需的额外结构，我们引入了一个通用决策-估计框架，灵感来源于在线强化学习的无模型决策-估计系数（DEC; Foster et al., 2023b; Liu et al., 2025b）。该框架将离线强化学习的复杂度分解为决策复杂度和价值估计误差，从而支持对两个子问题的模块化研究。我们的结果不仅统一了现有研究（Chen and Jiang, 2022; Uehara et al., 2023），更对其进行了改进与推广。在决策复杂度方面，我们的改进包括：首次给出部分覆盖下软$Q$学习的$ε^{-2}$样本复杂度界（优于Uehara等人（2023）的$ε^{-4}$界），移除Chen和Jiang（2022）的价值间隙设置中对额外在线交互的需求，以及上述两种情况之外的可学习新场景。在价值估计方面，我们提供了部分覆盖下贝尔曼完备性作用的新刻画，并首次刻画了一般低贝尔曼秩马尔可夫决策过程（Jiang et al., 2017; Du et al., 2021; Jin et al., 2021）的离线可学习性——这是在线强化学习的经典设定，除特殊情况外此前在离线强化学习中尚未被探索。作为附加贡献，我们的技术首次在函数近似设置下完成了CQL的分析。