In some applications of reinforcement learning, a dataset of pre-collected experience is already available but it is also possible to acquire some additional online data to help improve the quality of the policy. However, it may be preferable to gather additional data with a single, non-reactive exploration policy and avoid the engineering costs associated with switching policies. In this paper we propose an algorithm with provable guarantees that can leverage an offline dataset to design a single non-reactive policy for exploration. We theoretically analyze the algorithm and measure the quality of the final policy as a function of the local coverage of the original dataset and the amount of additional data collected.
翻译:在某些强化学习应用中,预先收集的经验数据集已经可用,但同时也可能获取部分额外在线数据以帮助提升策略质量。然而,使用单一非反应式探索策略收集额外数据,并避免因切换策略带来的工程成本可能更为可取。本文提出一种具有可证明保证的算法,该算法能够利用离线数据集设计单一非反应式探索策略。我们从理论上分析该算法,并依据原始数据集的局部覆盖程度与收集的额外数据量,评估最终策略的质量。