One of the fundamental challenges for offline reinforcement learning (RL) is ensuring robustness to data distribution. Whether the data originates from a near-optimal policy or not, we anticipate that an algorithm should demonstrate its ability to learn an effective control policy that seamlessly aligns with the inherent distribution of offline data. Unfortunately, behavior regularization, a simple yet effective offline RL algorithm, tends to struggle in this regard. In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration. Our key observation is that by iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement, while also implicitly avoiding querying out-of-sample actions to prevent catastrophic learning failures. We prove that in the tabular setting this algorithm is capable of learning the optimal policy covered by the offline dataset, commonly referred to as the in-sample optimal policy. We then explore several implementation details of the algorithm when function approximations are applied. The resulting algorithm is easy to implement, requiring only a few lines of code modification to existing methods. Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks, clearly demonstrate its superiority over behavior regularization.
翻译:离线强化学习的基础挑战之一是确保对数据分布的鲁棒性。无论数据来源于接近最优的策略与否,我们期望算法能够展示其学习有效控制策略的能力,该策略与离线数据的内在分布无缝对齐。遗憾的是,行为正则化——一种简单而有效的离线强化学习算法——在这方面往往表现不佳。本文提出了一种新算法,基于保守策略迭代显著增强了行为正则化。我们的关键观察是:通过迭代精炼用于行为正则化的参考策略,保守策略更新可保证逐步改进,同时隐式避免查询训练样本外的动作以防止灾难性学习失败。我们证明,在表格设置下该算法能够学习离线数据集覆盖的最优策略(通常称为样本内最优策略)。随后,我们探讨了应用函数逼近时算法的若干实现细节。最终算法易于实现,仅需对现有方法修改几行代码。在D4RL基准上的实验结果表明,我们的方法在大多数任务上优于先前的最先进基线,清晰证明了其相较行为正则化的优越性。