We study offline reinforcement learning under $Q^\star$-approximation and partial coverage, a setting that motivates practical algorithms such as Conservative $Q$-Learning (CQL; Kumar et al., 2020) but has received limited theoretical attention. Our work is inspired by the following open question: "Are $Q^\star$-realizability and Bellman completeness sufficient for sample-efficient offline RL under partial coverage?" We answer in the negative by establishing an information-theoretic lower bound. Going substantially beyond this, we introduce a general framework that characterizes the intrinsic complexity of a given $Q^\star$ function class, inspired by model-free decision-estimation coefficients (DEC) for online RL (Foster et al., 2023b; Liu et al., 2025b). This complexity recovers and improves the quantities underlying the guarantees of Chen and Jiang (2022) and Uehara et al. (2023), and extends to broader settings. Our decision-estimation decomposition can be combined with a wide range of $Q^\star$ estimation procedures, modularizing and generalizing existing approaches. Beyond the general framework, we make further contributions: By developing a novel second-order performance difference lemma, we obtain the first $ε^{-2}$ sample complexity under partial coverage for soft $Q$-learning, improving the $ε^{-4}$ bound of Uehara et al. (2023). We remove Chen and Jiang's (2022) need for additional online interaction when the value gap of $Q^\star$ is unknown. We also give the first characterization of offline learnability for general low-Bellman-rank MDPs without Bellman completeness (Jiang et al., 2017; Du et al., 2021; Jin et al., 2021), a canonical setting in online RL that remains unexplored in offline RL except for special cases. Finally, we provide the first analysis for CQL under $Q^\star$-realizability and Bellman completeness beyond the tabular case.
翻译:本研究探讨在$Q^\star$近似与部分覆盖条件下的离线强化学习,这一设定激励了诸如保守$Q$学习(CQL; Kumar等人,2020)等实用算法的发展,但尚未获得充分的理论关注。我们的研究灵感来源于以下开放性问题:"在部分覆盖条件下,$Q^\star$可实现性与贝尔曼完备性是否足以实现样本高效的离线强化学习?"我们通过建立信息论下界给出了否定答案。更重要的是,我们引入了一个受在线强化学习中无模型决策估计系数(DEC; Foster等人,2023b; Liu等人,2025b)启发的一般性框架,该框架能够刻画给定$Q^\star$函数类的内在复杂性。这一复杂性度量恢复并改进了Chen与Jiang(2022)及Uehara等人(2023)理论保证中的核心量,且可扩展至更广泛的设定。我们的决策-估计分解可与多种$Q^\star$估计过程相结合,从而模块化并推广了现有方法。除一般框架外,我们还取得了以下进展:通过提出一种新颖的二阶性能差异引理,我们首次在部分覆盖条件下为软$Q$学习获得了$ε^{-2}$的样本复杂度,改进了Uehara等人(2023)的$ε^{-4}$界。我们消除了Chen与Jiang(2022)在$Q^\star$值间隙未知时对额外在线交互的需求。此外,我们首次刻画了无贝尔曼完备性的一般低贝尔曼秩MDP(Jiang等人,2017; Du等人,2021; Jin等人,2021)的离线可学习性,这是在线强化学习中的经典设定,而在离线强化学习中除特殊情况外尚未被探索。最后,我们首次在非表格情形下,对$Q^\star$可实现性与贝尔曼完备性条件下的CQL进行了分析。