A Natural Extension To Online Algorithms For Hybrid RL With Limited Coverage

Hybrid Reinforcement Learning (RL), leveraging both online and offline data, has garnered recent interest, yet research on its provable benefits remains sparse. Additionally, many existing hybrid RL algorithms (Song et al., 2023; Nakamoto et al., 2023; Amortila et al., 2024) impose coverage assumptions on the offline dataset, but we show that this is unnecessary. A well-designed online algorithm should "fill in the gaps" in the offline dataset, exploring states and actions that the behavior policy did not explore. Unlike previous approaches that focus on estimating the offline data distribution to guide online exploration (Li et al., 2023b), we show that a natural extension to standard optimistic online algorithms -- warm-starting them by including the offline dataset in the experience replay buffer -- achieves similar provable gains from hybrid data even when the offline dataset does not have single-policy concentrability. We accomplish this by partitioning the state-action space into two, bounding the regret on each partition through an offline and an online complexity measure, and showing that the regret of this hybrid RL algorithm can be characterized by the best partition -- despite the algorithm not knowing the partition itself. As an example, we propose DISC-GOLF, a modification of an existing optimistic online algorithm with general function approximation called GOLF used in Jin et al. (2021); Xie et al. (2022a), and show that it demonstrates provable gains over both online-only and offline-only reinforcement learning, with competitive bounds when specialized to the tabular, linear and block MDP cases. Numerical simulations further validate our theory that hybrid data facilitates more efficient exploration, supporting the potential of hybrid RL in various scenarios.

翻译：混合强化学习（Hybrid RL）通过整合在线与离线数据，近年来引起广泛关注，但关于其可证明优势的研究仍较为匮乏。此外，现有混合强化学习算法（Song等人，2023；Nakamoto等人，2023；Amortila等人，2024）大多对离线数据集施加覆盖假设，而本文证明这一假设并非必要。设计良好的在线算法应能"填补"离线数据集中的空白，探索行为策略未覆盖的状态与动作。与先前聚焦于估计离线数据分布以指导在线探索的方法不同（Li等人，2023b），本文表明：标准乐观在线算法的自然扩展——通过将离线数据集纳入经验回放缓冲区进行热启动——即使离线数据集不具备单策略集中性，也能从混合数据中实现类似的可证明增益。我们通过将状态-动作空间划分为两部分实现这一目标，分别利用离线与在线复杂度度量约束各分区上的遗憾值，并证明该混合强化学习算法的遗憾值可由最优分区表征——尽管算法本身并不知晓该分区。作为实例，我们提出DISC-GOLF算法，该算法是对现有采用通用函数逼近的乐观在线算法GOLF（见于Jin等人，2021；Xie等人，2022a）的改进，并证明其在表格式、线性及块状MDP特化情形下，相较于纯在线与纯离线强化学习均展现出可证明优势。数值模拟进一步验证了混合数据可促进更高效探索的理论结果，支持了混合强化学习在多种场景中的应用潜力。

相关内容