Offline reinforcement learning (RL) aims to learn an effective policy from a pre-collected dataset. Most existing works are to develop sophisticated learning algorithms, with less emphasis on improving the data collection process. Moreover, it is even challenging to extend the single-task setting and collect a task-agnostic dataset that allows an agent to perform multiple downstream tasks. In this paper, we propose a Curiosity-driven Unsupervised Data Collection (CUDC) method to expand feature space using adaptive temporal distances for task-agnostic data collection and ultimately improve learning efficiency and capabilities for multi-task offline RL. To achieve this, CUDC estimates the probability of the k-step future states being reachable from the current states, and adapts how many steps into the future that the dynamics model should predict. With this adaptive reachability mechanism in place, the feature representation can be diversified, and the agent can navigate itself to collect higher-quality data with curiosity. Empirically, CUDC surpasses existing unsupervised methods in efficiency and learning performance in various downstream offline RL tasks of the DeepMind control suite.
翻译:离线强化学习旨在从预收集的数据集中学习有效策略。现有研究大多聚焦于开发复杂的学习算法,对改进数据收集过程的重视不足。此外,将单任务设置扩展到允许智能体执行多个下游任务的任务无关数据集收集更具挑战性。本文提出一种基于好奇心驱动的无监督数据收集方法,通过自适应时间距离扩展特征空间以实现任务无关数据收集,最终提升多任务离线强化学习的学习效率与能力。为此,CUDC估计当前状态下k步后未来状态的可达概率,并自适应调整动力学模型需预测的未来步数。借助这种自适应可达性机制,特征表示得以多样化,智能体能够通过好奇心引导自身收集更高质量的数据。实验表明,在DeepMind控制套件的多个下游离线强化学习任务中,CUDC在效率和学性能方面均超越现有无监督方法。