We consider the hybrid reinforcement learning setting where the agent has access to both offline data and online interactive access. While Reinforcement Learning (RL) research typically assumes offline data contains complete action, reward and transition information, datasets with only state information (also known as observation-only datasets) are more general, abundant and practical. This motivates our study of the hybrid RL with observation-only offline dataset framework. While the task of competing with the best policy "covered" by the offline data can be solved if a reset model of the environment is provided (i.e., one that can be reset to any state), we show evidence of hardness when only given the weaker trace model (i.e., one can only reset to the initial states and must produce full traces through the environment), without further assumption of admissibility of the offline data. Under the admissibility assumptions -- that the offline data could actually be produced by the policy class we consider -- we propose the first algorithm in the trace model setting that provably matches the performance of algorithms that leverage a reset model. We also perform proof-of-concept experiments that suggest the effectiveness of our algorithm in practice.
翻译:我们考虑混合强化学习场景,其中智能体可同时访问离线数据和在线交互环境。尽管强化学习研究通常假设离线数据包含完整的动作、奖励和转移信息,但仅包含状态信息的数据集(亦称纯观测数据集)更具普适性、丰富性和实用性。这促使我们研究基于纯观测离线数据集的混合强化学习框架。虽然当环境提供重置模型(即能重置至任意状态)时,可与离线数据"覆盖"的最优策略竞争的任务可被解决,但我们证明在仅给定较弱的轨迹模型(即只能重置至初始状态并必须生成完整环境轨迹)且无离线数据可采纳性进一步假设的情况下,该任务存在计算困难性。在可采纳性假设下——即离线数据确实可由所考虑的策略类生成——我们提出了轨迹模型设定中首个可证明达到重置模型算法性能的算法。我们还进行了概念验证实验,结果表明所提算法在实际应用中的有效性。