Offline reinforcement learning (RL) harnesses the power of massive datasets for resolving sequential decision problems. Most existing papers only discuss defending against out-of-distribution (OOD) actions while we investigate a broader issue, the false correlations between epistemic uncertainty and decision-making, an essential factor that causes suboptimality. In this paper, we propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm. We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL). The proposed algorithm introduces an annealing behavior cloning regularizer to help produce a high-quality estimation of uncertainty which is critical for eliminating false correlations from suboptimality. Theoretically, we justify the rationality of the proposed method and prove its convergence to the optimal policy with a sublinear rate under mild assumptions.
翻译:离线强化学习利用大规模数据集解决序列决策问题。现有文献大多仅讨论防御分布外动作,而本文研究了一个更本质的问题——认知不确定性与决策之间的虚假相关性,这是导致次优性的关键因素。本文提出了一种既实用有效又具备理论保证的离线强化学习算法——虚假相关性削减(SCORE)。实验表明,在标准基准(D4RL)的各类任务中,SCORE以3.1倍加速实现了最优性能。该算法引入退火行为克隆正则化器,通过生成高质量的不确定性估计来消除次优性中的虚假相关性。在理论层面,我们论证了所提方法的合理性,并证明在温和假设下算法能以次线性速率收敛至最优策略。