A compelling use case of offline reinforcement learning (RL) is to obtain a policy initialization from existing datasets followed by fast online fine-tuning with limited interaction. However, existing offline RL methods tend to behave poorly during fine-tuning. In this paper, we devise an approach for learning an effective initialization from offline data that also enables fast online fine-tuning capabilities. Our approach, calibrated Q-learning (Cal-QL), accomplishes this by learning a conservative value function initialization that underestimates the value of the learned policy from offline data, while also being calibrated, in the sense that the learned Q-values are at a reasonable scale. We refer to this property as calibration, and define it formally as providing a lower bound on the true value function of the learned policy and an upper bound on the value of some other (suboptimal) reference policy, which may simply be the behavior policy. We show that offline RL algorithms that learn such calibrated value functions lead to effective online fine-tuning, enabling us to take the benefits of offline initializations in online fine-tuning. In practice, Cal-QL can be implemented on top of the conservative Q learning (CQL) for offline RL within a one-line code change. Empirically, Cal-QL outperforms state-of-the-art methods on 9/11 fine-tuning benchmark tasks that we study in this paper. Code and video are available at https://nakamotoo.github.io/Cal-QL
翻译:离线强化学习(RL)的一个引人注目的应用场景是从现有数据集中获取策略初始化,随后通过有限的交互进行快速在线微调。然而,现有离线RL方法在微调阶段往往表现不佳。本文提出了一种从离线数据中学习有效初始化并同时具备快速在线微调能力的方法。我们的方法——校准式Q学习(Cal-QL)通过学习一个保守的值函数初始化来实现这一目标:该初始化低估了从离线数据中学习到的策略的价值,同时具备校准性——即学习到的Q值处于合理尺度。我们将此性质称为校准,并形式化定义为:对学习策略的真实值函数提供下界,同时对其他(次优)参考策略(可简化为行为策略)的值函数提供上界。研究表明,学习此类校准值函数的离线RL算法能有效实现在线微调,使我们能够在在线微调中充分受益于离线初始化。在实践中,Cal-QL可通过单行代码修改在离线RL的保守Q学习(CQL)基础上实现。实验表明,Cal-QL在本文研究的9/11个微调基准任务中优于最先进方法。代码与视频已开源至https://nakamotoo.github.io/Cal-QL