Offline reinforcement learning (RL) seeks to derive an effective control policy from previously collected data. To circumvent errors due to inadequate data coverage, behavior-regularized methods optimize the control policy while concurrently minimizing deviation from the data collection policy. Nevertheless, these methods often exhibit subpar practical performance, particularly when the offline dataset is collected by sub-optimal policies. In this paper, we propose a novel algorithm employing in-sample policy iteration that substantially enhances behavior-regularized methods in offline RL. The core insight is that by continuously refining the policy used for behavior regularization, in-sample policy iteration gradually improves itself while implicitly avoids querying out-of-sample actions to avert catastrophic learning failures. Our theoretical analysis verifies its ability to learn the in-sample optimal policy, exclusively utilizing actions well-covered by the dataset. Moreover, we propose competitive policy improvement, a technique applying two competitive policies, both of which are trained by iteratively improving over the best competitor. We show that this simple yet potent technique significantly enhances learning efficiency when function approximation is applied. Lastly, experimental results on the D4RL benchmark indicate that our algorithm outperforms previous state-of-the-art methods in most tasks.
翻译:离线强化学习旨在从先前收集的数据中推导出有效的控制策略。为避免因数据覆盖不足导致的误差,行为正则化方法在优化控制策略的同时,最小化与数据收集策略的偏差。然而,这些方法在实际应用中往往表现欠佳,尤其是当离线数据集由次优策略收集时。本文提出一种采用样本内策略迭代的新算法,显著增强了离线强化学习中的行为正则化方法。核心洞察在于:通过持续精炼用于行为正则化的策略,样本内策略迭代逐步自我改进,同时隐式避免查询样本外动作,从而防止灾难性学习失败。理论分析验证了该方法能够学习样本内最优策略,且仅利用数据集中充分覆盖的动作。此外,我们提出竞争性策略改进技术——应用两个竞争策略,两者均通过迭代超越最佳竞争者进行训练。研究表明,这一简单且强大的技术在使用函数近似时显著提升了学习效率。最后,在D4RL基准上的实验结果表明,我们的算法在大多数任务中超越了先前的最优方法。