Offline reinforcement learning algorithms hold the promise of enabling data-driven RL methods that do not require costly or dangerous real-world exploration and benefit from large pre-collected datasets. This in turn can facilitate real-world applications, as well as a more standardized approach to RL research. Furthermore, offline RL methods can provide effective initializations for online finetuning to overcome challenges with exploration. However, evaluating progress on offline RL algorithms requires effective and challenging benchmarks that capture properties of real-world tasks, provide a range of task difficulties, and cover a range of challenges both in terms of the parameters of the domain (e.g., length of the horizon, sparsity of rewards) and the parameters of the data (e.g., narrow demonstration data or broad exploratory data). While considerable progress in offline RL in recent years has been enabled by simpler benchmark tasks, the most widely used datasets are increasingly saturating in performance and may fail to reflect properties of realistic tasks. We propose a new benchmark for offline RL that focuses on realistic simulations of robotic manipulation and locomotion environments, based on models of real-world robotic systems, and comprising a variety of data sources, including scripted data, play-style data collected by human teleoperators, and other data sources. Our proposed benchmark covers state-based and image-based domains, and supports both offline RL and online fine-tuning evaluation, with some of the tasks specifically designed to require both pre-training and fine-tuning. We hope that our proposed benchmark will facilitate further progress on both offline RL and fine-tuning algorithms. Website with code, examples, tasks, and data is available at \url{https://sites.google.com/view/d5rl/}
翻译:离线强化学习算法有望实现数据驱动的强化学习方法,该方法无需昂贵或危险的现实世界探索,并能受益于大规模预收集数据集。这反过来可以促进实际应用,并为强化学习研究提供更标准化的途径。此外,离线强化学习方法可为在线微调提供有效的初始化,以克服探索方面的挑战。然而,评估离线强化学习算法的进展需要有效且具有挑战性的基准测试,这些基准需能捕捉现实任务特性、提供不同难度级别的任务,并覆盖领域参数(如任务时域长度、奖励稀疏性)和数据参数(如狭窄的演示数据或广泛的探索性数据)两方面的多种挑战。尽管近年来较简单的基准任务推动了离线强化学习的显著进展,但最广泛使用的数据集性能正逐渐趋于饱和,且可能无法反映现实任务的特性。我们提出一种新的离线强化学习基准,其专注于机器人操作与运动环境的真实模拟,基于真实世界机器人系统模型,并整合了多种数据源,包括脚本数据、人类遥操作员收集的演示式数据及其他数据源。我们提出的基准涵盖基于状态和基于图像的领域,支持离线强化学习与在线微调评估,其中部分任务专门设计为需要预训练与微调相结合。我们希望所提出的基准能促进离线强化学习与微调算法的进一步发展。相关代码、示例、任务及数据的网站可通过 \url{https://sites.google.com/view/d5rl/} 访问。