We consider offline reinforcement learning (RL) in $H$-horizon Markov decision processes (MDPs) under the linear $q^\pi$-realizability assumption, where the action-value function of every policy is linear with respect to a given $d$-dimensional feature function. The hope in this setting is that learning a good policy will be possible without requiring a sample size that scales with the number of states in the MDP. Foster et al. [2021] have shown this to be impossible even under $\textit{concentrability}$, a data coverage assumption where a coefficient $C_\text{conc}$ bounds the extent to which the state-action distribution of any policy can veer off the data distribution. However, the data in this previous work was in the form of a sequence of individual transitions. This leaves open the question of whether the negative result mentioned could be overcome if the data was composed of sequences of full trajectories. In this work we answer this question positively by proving that with trajectory data, a dataset of size $\text{poly}(d,H,C_\text{conc})/\epsilon^2$ is sufficient for deriving an $\epsilon$-optimal policy, regardless of the size of the state space. The main tool that makes this result possible is due to Weisz et al. [2023], who demonstrate that linear MDPs can be used to approximate linearly $q^\pi$-realizable MDPs. The connection to trajectory data is that the linear MDP approximation relies on "skipping" over certain states. The associated estimation problems are thus easy when working with trajectory data, while they remain nontrivial when working with individual transitions. The question of computational efficiency under our assumptions remains open.
翻译:我们考虑在$H$步马尔可夫决策过程(MDPs)中,基于线性$q^π$可实现性假设的离线强化学习(RL),其中每个策略的动作值函数相对于给定的$d$维特征函数是线性的。该设定下的期望是,学习一个良好策略可能无需要求样本规模随MDP状态数量增长。Foster等人[2021]已证明即使在$\textit{集中性}$(一种数据覆盖假设,其中系数$C_\text{conc}$限制了任何策略的状态-动作分布偏离数据分布的程度)条件下,这也是不可能的。然而,该先前工作中的数据形式是单个转移序列。这留下了一个开放问题:如果数据由完整轨迹序列构成,上述负面结果是否可能被克服?在本工作中,我们通过证明使用轨迹数据时,规模为$\text{poly}(d,H,C_\text{conc})/\epsilon^2$的数据集足以推导出$\epsilon$最优策略,从而正面回答了该问题,且该结果与状态空间大小无关。实现此结果的主要工具源于Weisz等人[2023]的研究,他们证明了线性MDP可用于近似线性$q^π$可实现的MDP。与轨迹数据的关联在于,线性MDP近似依赖于“跳过”某些状态。因此,当使用轨迹数据时,相关的估计问题变得容易,而使用单个转移数据时这些问题仍然具有挑战性。在我们的假设下,计算效率问题仍然保持开放。