Recently, Offline Reinforcement Learning (RL) has achieved remarkable progress with the emergence of various algorithms and datasets. However, these methods usually focus on algorithmic advancements, ignoring that many low-level implementation choices considerably influence or even drive the final performance. As a result, it becomes hard to attribute the progress in Offline RL as these choices are not sufficiently discussed and aligned in the literature. In addition, papers focusing on a dataset (e.g., D4RL) often ignore algorithms proposed on another dataset (e.g., RL Unplugged), causing isolation among the algorithms, which might slow down the overall progress. Therefore, this work aims to bridge the gaps caused by low-level choices and datasets. To this end, we empirically investigate 20 implementation choices using three representative algorithms (i.e., CQL, CRR, and IQL) and present a guidebook for choosing implementations. Following the guidebook, we find two variants CRR+ and CQL+ , achieving new state-of-the-art on D4RL. Moreover, we benchmark eight popular offline RL algorithms across datasets under unified training and evaluation framework. The findings are inspiring: the success of a learning paradigm severely depends on the data distribution, and some previous conclusions are biased by the dataset used. Our code is available at https://github.com/sail-sg/offbench.
翻译:近期,随着多种算法和数据集的涌现,离线强化学习取得了显著进展。然而,这些方法通常聚焦于算法层面的创新,忽略了诸多底层实现选择对最终性能产生的重大影响,甚至在某些情况下成为驱动性能的关键因素。由于文献中未能充分讨论和统一这些选择,导致难以准确归因离线强化学习的进步。此外,针对特定数据集(如D4RL)的论文往往忽略其他数据集(如RL Unplugged)上提出的算法,造成算法间的孤立,可能延缓整体进展。因此,本研究旨在弥合因底层选择与数据集差异造成的鸿沟。为此,我们基于三种代表性算法(即CQL、CRR和IQL),对20种实现选择进行了实证研究,并编写了实现选择指南。根据该指南,我们发现了两个变体——CRR+和CQL+,它们在D4RL上取得了新的最佳性能。此外,我们在统一的训练与评估框架下,对八个主流离线强化学习算法跨数据集进行了基准测试。研究结果颇具启发性:学习范式的成功严重依赖于数据分布,而以往的一些结论因所用数据集而存在偏差。我们的代码开源在https://github.com/sail-sg/offbench。