Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Data

The modern paradigm in machine learning involves pre-training on diverse data, followed by task-specific fine-tuning. In reinforcement learning (RL), this translates to learning via offline RL on a diverse historical dataset, followed by rapid online RL fine-tuning using interaction data. Most RL fine-tuning methods require continued training on offline data for stability and performance. However, this is undesirable because training on diverse offline data is slow and expensive for large datasets, and in principle, also limit the performance improvement possible because of constraints or pessimism on offline data. In this paper, we show that retaining offline data is unnecessary as long as we use a properly-designed online RL approach for fine-tuning offline RL initializations. To build this approach, we start by analyzing the role of retaining offline data in online fine-tuning. We find that continued training on offline data is mostly useful for preventing a sudden divergence in the value function at the onset of fine-tuning, caused by a distribution mismatch between the offline data and online rollouts. This divergence typically results in unlearning and forgetting the benefits of offline pre-training. Our approach, Warm-start RL (WSRL), mitigates the catastrophic forgetting of pre-trained initializations using a very simple idea. WSRL employs a warmup phase that seeds the online RL run with a very small number of rollouts from the pre-trained policy to do fast online RL. The data collected during warmup helps ``recalibrate'' the offline Q-function to the online distribution, allowing us to completely discard offline data without destabilizing the online RL fine-tuning. We show that WSRL is able to fine-tune without retaining any offline data, and is able to learn faster and attains higher performance than existing algorithms irrespective of whether they retain offline data or not.

翻译：现代机器学习范式通常涉及在多样化数据上进行预训练，随后进行任务特定的微调。在强化学习（RL）中，这体现为首先通过离线RL在多样化的历史数据集上学习，随后利用交互数据进行快速的在线RL微调。大多数RL微调方法为了稳定性和性能，需要持续在离线数据上进行训练。然而，这种做法并不理想，因为对大规模多样化离线数据进行训练既缓慢又昂贵，并且原则上，由于离线数据的约束或悲观性，也会限制可能的性能提升。本文证明，只要采用一种经过恰当设计的在线RL方法对离线RL初始化进行微调，保留离线数据是不必要的。为构建此方法，我们首先分析了保留离线数据在在线微调中的作用。我们发现，持续在离线数据上训练的主要作用在于防止微调开始时价值函数的突然发散，这种发散是由离线数据与在线交互轨迹之间的分布不匹配引起的。这种发散通常导致遗忘和丧失离线预训练带来的益处。我们的方法——热启动强化学习（WSRL）——采用一个非常简单的思路来缓解对预训练初始化的灾难性遗忘。WSRL引入一个预热阶段，该阶段使用来自预训练策略的极少量的交互轨迹来启动在线RL运行，从而实现快速在线RL。在预热阶段收集的数据有助于将离线Q函数“重新校准”到在线分布，使我们能够完全丢弃离线数据，而不会破坏在线RL微调的稳定性。我们证明，WSRL能够在完全不保留任何离线数据的情况下进行微调，并且无论现有算法是否保留离线数据，WSRL都能学得更快并达到更高的性能。