A fundamental challenge in offline reinforcement learning is distributional shift. Scarce data or datasets dominated by out-of-distribution (OOD) areas exacerbate this issue. Our theoretical analysis and experiments show that the standard squared error objective induces a harmful TD cross covariance. This effect amplifies in OOD areas, biasing optimization and degrading policy learning. To counteract this mechanism, we develop two complementary strategies: partitioned buffer sampling that restricts updates to localized replay partitions, attenuates irregular covariance effects, and aligns update directions, yielding a scheme that is easy to integrate with existing implementations, namely Clustered Cross-Covariance Control for TD (C^4). We also introduce an explicit gradient-based corrective penalty that cancels the covariance induced bias within each update. We prove that buffer partitioning preserves the lower bound property of the maximization objective, and that these constraints mitigate excessive conservatism in extreme OOD areas without altering the core behavior of policy constrained offline reinforcement learning. Empirically, our method showcases higher stability and up to 30% improvement in returns over prior methods, especially with small datasets and splits that emphasize OOD areas.
翻译:离线强化学习中的一个根本性挑战是分布偏移。数据稀缺或由分布外区域主导的数据集会加剧这一问题。我们的理论分析和实验表明,标准平方误差目标会引入有害的时序差分交叉协方差。这种效应在分布外区域被放大,导致优化偏差并损害策略学习。为抵消此机制,我们开发了两种互补策略:分区缓冲采样,该方法将更新限制在局部回放分区内,减弱不规则协方差效应,并对齐更新方向,从而产生一种易于与现有实现集成的方案,即时序差分聚类交叉协方差控制。我们还引入了一种显式的基于梯度的校正惩罚项,以在每次更新中消除协方差引起的偏差。我们证明了缓冲分区保留了最大化目标的下界性质,并且这些约束能在不改变策略约束离线强化学习核心行为的前提下,缓解极端分布外区域的过度保守性。实验表明,相较于现有方法,我们的方法展现出更高的稳定性,并在回报上实现了高达30%的提升,尤其是在小规模数据集和强调分布外区域的数据划分上。