Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Data

The modern paradigm in machine learning involves pre-training on diverse data, followed by task-specific fine-tuning. In reinforcement learning (RL), this translates to learning via offline RL on a diverse historical dataset, followed by rapid online RL fine-tuning using interaction data. Most RL fine-tuning methods require continued training on offline data for stability and performance. However, this is undesirable because training on diverse offline data is slow and expensive for large datasets, and in principle, also limit the performance improvement possible because of constraints or pessimism on offline data. In this paper, we show that retaining offline data is unnecessary as long as we use a properly-designed online RL approach for fine-tuning offline RL initializations. To build this approach, we start by analyzing the role of retaining offline data in online fine-tuning. We find that continued training on offline data is mostly useful for preventing a sudden divergence in the value function at the onset of fine-tuning, caused by a distribution mismatch between the offline data and online rollouts. This divergence typically results in unlearning and forgetting the benefits of offline pre-training. Our approach, Warm-start RL (WSRL), mitigates the catastrophic forgetting of pre-trained initializations using a very simple idea. WSRL employs a warmup phase that seeds the online RL run with a very small number of rollouts from the pre-trained policy to do fast online RL. The data collected during warmup helps ``recalibrate'' the offline Q-function to the online distribution, allowing us to completely discard offline data without destabilizing the online RL fine-tuning. We show that WSRL is able to fine-tune without retaining any offline data, and is able to learn faster and attains higher performance than existing algorithms irrespective of whether they retain offline data or not.

翻译：现代机器学习范式通常包括在多样化数据上进行预训练，然后进行任务特定的微调。在强化学习（RL）中，这体现为首先利用多样化的历史数据集通过离线RL进行学习，随后使用交互数据进行快速的在线RL微调。大多数RL微调方法为了稳定性和性能，需要持续在离线数据上进行训练。然而，这种做法并不理想，因为对大型多样化离线数据集进行训练既缓慢又昂贵，并且原则上，由于离线数据上的约束或悲观估计，这也限制了可能的性能提升。在本文中，我们证明，只要使用一种经过适当设计的在线RL方法对离线RL初始化进行微调，保留离线数据是不必要的。为了构建这种方法，我们首先分析了保留离线数据在在线微调中的作用。我们发现，持续在离线数据上训练的主要作用在于防止微调开始时价值函数的突然发散，这种发散是由离线数据与在线采样轨迹之间的分布不匹配引起的。这种发散通常会导致遗忘和丧失离线预训练带来的益处。我们的方法——热启动强化学习（WSRL）——采用一个非常简单的想法来缓解对预训练初始化的灾难性遗忘。WSRL采用一个预热阶段，该阶段使用来自预训练策略的极少量采样轨迹来启动在线RL运行，以进行快速的在线RL。在预热期间收集的数据有助于将离线Q函数“重新校准”到在线分布，从而允许我们在不破坏在线RL微调稳定性的情况下完全丢弃离线数据。我们证明，WSRL能够在无需保留任何离线数据的情况下进行微调，并且无论现有算法是否保留离线数据，WSRL都能学习得更快并达到更高的性能。