This paper studies tabular reinforcement learning (RL) in the hybrid setting, which assumes access to both an offline dataset and online interactions with the unknown environment. A central question boils down to how to efficiently utilize online data collection to strengthen and complement the offline dataset and enable effective policy fine-tuning. Leveraging recent advances in reward-agnostic exploration and model-based offline RL, we design a three-stage hybrid RL algorithm that beats the best of both worlds -- pure offline RL and pure online RL -- in terms of sample complexities. The proposed algorithm does not require any reward information during data collection. Our theory is developed based on a new notion called single-policy partial concentrability, which captures the trade-off between distribution mismatch and miscoverage and guides the interplay between offline and online data.
翻译:本文研究表格型强化学习在混合场景下的应用,该场景假设同时拥有离线数据集和与未知环境的在线交互能力。核心问题在于如何高效利用在线数据收集来增强和补充离线数据集,从而实现有效的策略微调。通过借鉴奖励无关探索和基于模型的离线强化学习的最新进展,我们设计了一种三阶段混合强化学习算法,该算法在样本复杂度方面优于纯离线强化学习和纯在线强化学习这两种极端方案。所提出的算法在数据收集过程中不需要任何奖励信息。我们的理论基于一种称为单策略部分可集中性的新概念,该概念刻画了分布不匹配与覆盖不足之间的权衡,并指导离线与在线数据之间的协同作用。