In reinforcement learning (RL), there are two major settings for interacting with the environment: online and offline. Online methods explore the environment at significant time cost, and offline methods efficiently obtain reward signals by sacrificing exploration capability. We propose semi-offline RL, a novel paradigm that smoothly transits from offline to online settings, balances exploration capability and training cost, and provides a theoretical foundation for comparing different RL settings. Based on the semi-offline formulation, we present the RL setting that is optimal in terms of optimization cost, asymptotic error, and overfitting error bound. Extensive experiments show that our semi-offline approach is efficient and yields comparable or often better performance compared with state-of-the-art methods.
翻译:在强化学习(RL)中,与环境的交互存在两种主要范式:在线学习与离线学习。在线方法以显著的时间成本探索环境,而离线方法则通过牺牲探索能力来高效获取奖励信号。我们提出半离线强化学习这一新颖范式,该范式能够实现从离线到在线设置的平滑过渡,平衡探索能力与训练成本,并为不同RL设置的比较提供理论依据。基于半离线形式化框架,我们给出了在优化成本、渐近误差和过拟合误差界方面最优的RL设置。大量实验表明,我们的半离线方法高效且性能可比拟甚至超越当前最优方法。