Leveraging offline data is a promising way to improve the sample efficiency of online reinforcement learning (RL). This paper expands the pool of usable data for offline-to-online RL by leveraging abundant non-curated data that is reward-free, of mixed quality, and collected across multiple embodiments. Although learning a world model appears promising for utilizing such data, we find that naive fine-tuning fails to accelerate RL training on many tasks. Through careful investigation, we attribute this failure to the distributional shift between offline and online data during fine-tuning. To address this issue and effectively use the offline data, we propose two techniques: \emph{i)} experience rehearsal and \emph{ii)} execution guidance. With these modifications, the non-curated offline data substantially improves RL's sample efficiency. Under limited sample budgets, our method achieves nearly twice the aggregate score of learning-from-scratch baselines across 72 visuomotor tasks spanning 6 embodiments. On challenging tasks such as locomotion and robotic manipulation, it outperforms prior methods that utilize offline data by a decent margin.
翻译:利用离线数据是提升在线强化学习(RL)样本效率的有效途径。本文通过利用丰富的非策划数据(即无奖励、质量参差不齐且跨多种实体收集的数据)来扩展可用于离线到在线强化学习的数据池。尽管学习世界模型看似适用于此类数据,但我们发现直接微调在许多任务上无法加速强化学习训练。通过细致研究,我们将这一失败归因于微调过程中离线数据与在线数据之间的分布偏移。为解决该问题并有效利用离线数据,我们提出两种技术:\emph{i)}经验回放和\emph{ii)}执行引导。通过上述改进,非策划离线数据显著提升了强化学习的样本效率。在有限样本预算下,我们的方法在涵盖6种实体的72项视觉运动任务中,总得分几乎是基础学习基线方法的两倍。在诸如运动控制和机械臂操作等具有挑战性的任务上,其性能较之利用离线数据的现有方法亦有显著提升。