Sample efficiency and exploration remain major challenges in online reinforcement learning (RL). A powerful approach that can be applied to address these issues is the inclusion of offline data, such as prior trajectories from a human expert or a sub-optimal exploration policy. Previous methods have relied on extensive modifications and additional complexity to ensure the effective use of this data. Instead, we ask: can we simply apply existing off-policy methods to leverage offline data when learning online? In this work, we demonstrate that the answer is yes; however, a set of minimal but important changes to existing off-policy RL algorithms are required to achieve reliable performance. We extensively ablate these design choices, demonstrating the key factors that most affect performance, and arrive at a set of recommendations that practitioners can readily apply, whether their data comprise a small number of expert demonstrations or large volumes of sub-optimal trajectories. We see that correct application of these simple recommendations can provide a $\mathbf{2.5\times}$ improvement over existing approaches across a diverse set of competitive benchmarks, with no additional computational overhead.
翻译:样本效率和探索仍然是在线强化学习中的主要挑战。一个可用于解决这些问题的强大方法是引入离线数据,例如来自人类专家或次优探索策略的先前轨迹。以往的方法依赖大量修改和额外复杂性来确保这些数据的有效利用。相反,我们提出一个问题:在在线学习过程中,能否直接应用现有的离策略方法来利用离线数据?在这项工作中,我们证明答案是肯定的;然而,需要对现有的离策略强化学习算法进行最小但重要的修改,以实现可靠的性能。我们广泛消融了这些设计选择,揭示了影响性能的关键因素,并得出一组实践者可以轻松应用的推荐方案,无论其数据包含少量专家演示还是大量次优轨迹。我们发现,正确应用这些简单推荐方案可以在多样化的竞争性基准测试中相较于现有方法提供$\mathbf{2.5\times}$的性能提升,且无需额外计算开销。