Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning

A compelling use case of offline reinforcement learning (RL) is to obtain a policy initialization from existing datasets, which allows efficient fine-tuning with limited amounts of active online interaction. However, several existing offline RL methods tend to exhibit poor online fine-tuning performance. On the other hand, online RL methods can learn effectively through online interaction, but struggle to incorporate offline data, which can make them very slow in settings where exploration is challenging or pre-training is necessary. In this paper, we devise an approach for learning an effective initialization from offline data that also enables fast online fine-tuning capabilities. Our approach, calibrated Q-learning (Cal-QL) accomplishes this by learning a conservative value function initialization that underestimates the value of the learned policy from offline data, while also being calibrated, in the sense that the learned Q-values are at a reasonable scale. We refer to this property as calibration, and define it formally as providing a lower bound on the true value function of the learned policy and an upper bound on the value of some other (suboptimal) reference policy, which may simply be the behavior policy. We show that offline RL algorithms that learn such calibrated value functions lead to effective online fine-tuning, enabling us to take the benefits of offline initializations in online fine-tuning. In practice, Cal-QL can be implemented on top of existing conservative methods for offline RL within a one-line code change. Empirically, Cal-QL outperforms state-of-the-art methods on 10/11 fine-tuning benchmark tasks that we study in this paper.

翻译：摘要：离线强化学习（RL）的一个引人注目的应用场景是从现有数据集中获取策略初始化，从而能够通过有限量的主动在线交互实现高效微调。然而，现有的一些离线RL方法在在线微调性能上往往表现不佳。另一方面，在线RL方法可以通过在线交互有效学习，但难以利用离线数据，这使得它们在需要探索或预训练的场景中学习速度非常缓慢。本文提出了一种方法，能够从离线数据中学习有效的初始化，同时具备快速在线微调能力。我们的方法——校准Q学习（Cal-QL）通过学习保守的值函数初始化实现这一点，该函数低估了从离线数据中学到的策略的价值，同时保持校准性，即学习到的Q值处于合理尺度。我们将此性质称为校准，并形式化地将其定义为：对学习策略的真实值函数提供下界，同时对某个其他（次优）参考策略（可能仅仅是行为策略）的值提供上界。我们证明，学习此类校准值函数的离线RL算法能够实现有效的在线微调，从而在在线微调中充分利用离线初始化的优势。实践中，Cal-QL可以通过一行代码修改，在现有保守型离线RL方法的基础上实现。经验结果表明，在本文研究的10/11个微调基准任务中，Cal-QL优于现有最先进方法。