This paper considers an online reinforcement learning algorithm that leverages pre-collected data (passive memory) from the environment for online interaction. We show that using passive memory improves performance and further provide theoretical guarantees for regret that turns out to be near-minimax optimal. Results show that the quality of passive memory determines sub-optimality of the incurred regret. The proposed approach and results hold in both continuous and discrete state-action spaces.
翻译:本文研究了一种在线强化学习算法,该算法利用环境中预先收集的数据(被动记忆)进行在线交互。我们证明使用被动记忆能够提升算法性能,并进一步提供了遗憾界的理论保证,该遗憾界被证明接近极小极大最优。结果表明,被动记忆的质量决定了所产生遗憾的次优程度。所提出的方法与结论在连续和离散的状态-动作空间中均成立。