Offline reinforcement learning (RL) harnesses the power of massive datasets for resolving sequential decision problems. Most existing papers only discuss defending against out-of-distribution (OOD) actions while we investigate a broader issue, the spurious correlations between epistemic uncertainty and decision-making, an essential factor that causes suboptimality. In this paper, we propose Spurious COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm. We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL). The proposed algorithm introduces an annealing behavior cloning regularizer to help produce a high-quality estimation of uncertainty which is critical for eliminating spurious correlations from suboptimality. Theoretically, we justify the rationality of the proposed method and prove its convergence to the optimal policy with a sublinear rate under mild assumptions.
翻译:离线强化学习利用大规模数据集解决序列决策问题。现有文献主要针对分布外动作的防御展开讨论,而我们研究了一个更广泛的问题——认知不确定性与决策之间的伪相关性,这是导致次优性的关键因素。本文提出了一种实用且具有理论可证明性的离线强化学习算法——伪相关减少方法(SCORE)。实验表明,在标准基准测试D4RL的多种任务中,SCORE以3.1倍加速实现了最先进的性能。该算法引入了退火行为克隆正则化器,有助于生成高质量的不确定性估计,这对消除次优性中的伪相关性至关重要。在理论上,我们论证了所提方法的合理性,并证明其在温和假设下能以次线性速率收敛到最优策略。