Two central paradigms have emerged in the reinforcement learning (RL) community: online RL and offline RL. In the online RL setting, the agent has no prior knowledge of the environment, and must interact with it in order to find an $\epsilon$-optimal policy. In the offline RL setting, the learner instead has access to a fixed dataset to learn from, but is unable to otherwise interact with the environment, and must obtain the best policy it can from this offline data. Practical scenarios often motivate an intermediate setting: if we have some set of offline data and, in addition, may also interact with the environment, how can we best use the offline data to minimize the number of online interactions necessary to learn an $\epsilon$-optimal policy? In this work, we consider this setting, which we call the \textsf{FineTuneRL} setting, for MDPs with linear structure. We characterize the necessary number of online samples needed in this setting given access to some offline dataset, and develop an algorithm, \textsc{FTPedel}, which is provably optimal, up to $H$ factors. We show through an explicit example that combining offline data with online interactions can lead to a provable improvement over either purely offline or purely online RL. Finally, our results illustrate the distinction between \emph{verifiable} learning, the typical setting considered in online RL, and \emph{unverifiable} learning, the setting often considered in offline RL, and show that there is a formal separation between these regimes.
翻译:强化学习(RL)领域涌现出两大核心范式:在线强化学习与离线强化学习。在线RL设定中,智能体对环境无先验知识,必须通过与环境交互来寻找$\epsilon$-最优策略。而在离线RL设定中,学习器仅能访问固定数据集进行学习,无法与环境交互,必须从该离线数据中获取最优策略。实际场景往往催生中间设定:若我们拥有部分离线数据,且同时能与环境交互,如何最优地利用离线数据以最小化学习$\epsilon$-最优策略所需的在线交互次数?本文针对具有线性结构的马尔可夫决策过程(MDP)考虑这一设定,称之为\textsf{FineTuneRL}设定。我们刻画了在该设定下给定离线数据集时所需的必要在线样本数,并提出了算法\textsc{FTPedel},该算法在$H$因子意义上被证明具有最优性。通过显式示例,我们证明离线数据与在线交互的结合能够带来优于纯离线或纯在线RL的可证明改进。最后,我们的结果阐明了在线RL中常见的\textit{可验证}学习与离线RL中常见的\textit{不可验证}学习之间的区别,并表明这两个范式之间存在形式上的分离。