Offline reinforcement learning (RL) aims to learn a policy from a static dataset without further interactions with the environment. Collecting sufficiently large datasets for offline RL is exhausting since this data collection requires colossus interactions with environments and becomes tricky when the interaction with the environment is restricted. Hence, how an agent learns the best policy with a minimal static dataset is a crucial issue in offline RL, similar to the sample efficiency problem in online RL. In this paper, we propose a simple yet effective plug-and-play pretraining method to initialize a feature of a Q-network to enhance data efficiency in offline RL. Specifically, we introduce a shared Q-network structure that outputs predictions of the next state and Q-value. We pretrain the shared Q-network through a supervised regression task that predicts a next state and trains the shared Q-network using diverse offline RL methods. Through extensive experiments, we empirically demonstrate that our method enhances the performance of existing popular offline RL methods on the D4RL, Robomimic and V-D4RL benchmarks. Furthermore, we show that our method significantly boosts data-efficient offline RL across various data qualities and data distributions trough D4RL and ExoRL benchmarks. Notably, our method adapted with only 10% of the dataset outperforms standard algorithms even with full datasets.
翻译:离线强化学习(RL)旨在从静态数据集中学习策略,而无需与环境进行进一步交互。为离线RL收集足够大的数据集是耗时的,因为这种数据收集需要与环境进行大量交互,并且当与环境交互受限时会变得棘手。因此,智能体如何以最小的静态数据集学习最佳策略是离线RL中的一个关键问题,类似于在线RL中的样本效率问题。在本文中,我们提出了一种简单而有效的即插即用预训练方法,用于初始化Q网络的特征,以增强离线RL中的数据效率。具体而言,我们引入了一种共享Q网络结构,该结构输出下一状态和Q值的预测。我们通过监督回归任务预训练共享Q网络,该任务预测下一状态,并使用多种离线RL方法训练共享Q网络。通过大量实验,我们经验性地证明了我们的方法在D4RL、Robomimic和V-D4RL基准测试中提升了现有流行离线RL方法的性能。此外,我们通过D4RL和ExoRL基准测试表明,我们的方法在各种数据质量和数据分布下显著提升了数据高效的离线RL。值得注意的是,我们的方法仅使用数据集的10%进行适配,其性能甚至超过了使用完整数据集的标准算法。