Using Curiosity for an Even Representation of Tasks in Continual Offline Reinforcement Learning

In this work, we investigate the means of using curiosity on replay buffers to improve offline multi-task continual reinforcement learning when tasks, which are defined by the non-stationarity in the environment, are non labeled and not evenly exposed to the learner in time. In particular, we investigate the use of curiosity both as a tool for task boundary detection and as a priority metric when it comes to retaining old transition tuples, which we respectively use to propose two different buffers. Firstly, we propose a Hybrid Reservoir Buffer with Task Separation (HRBTS), where curiosity is used to detect task boundaries that are not known due to the task agnostic nature of the problem. Secondly, by using curiosity as a priority metric when it comes to retaining old transition tuples, a Hybrid Curious Buffer (HCB) is proposed. We ultimately show that these buffers, in conjunction with regular reinforcement learning algorithms, can be used to alleviate the catastrophic forgetting issue suffered by the state of the art on replay buffers when the agent's exposure to tasks is not equal along time. We evaluate catastrophic forgetting and the efficiency of our proposed buffers against the latest works such as the Hybrid Reservoir Buffer (HRB) and the Multi-Time Scale Replay Buffer (MTR) in three different continual reinforcement learning settings. Experiments were done on classical control tasks and Metaworld environment. Experiments show that our proposed replay buffers display better immunity to catastrophic forgetting compared to existing works in most of the settings.

翻译：在本研究中，我们探讨了在任务未标注且时间上不均衡暴露给学习器的情况下，利用回放缓冲区中的好奇心机制来改进离线多任务持续强化学习的方法。这些任务由环境中的非平稳性定义。具体而言，我们研究了好奇心在任务边界检测和保留旧状态转移元组的优先级度量中的双重作用，并据此提出两种不同的缓冲区。首先，我们提出了一种带有任务分离的混合水库缓冲区（HRBTS），其中利用好奇心检测由于问题任务无关特性而未知的任务边界。其次，通过将好奇心作为保留旧状态转移元组的优先级度量，我们提出了一种混合好奇心缓冲区（HCB）。我们最终证明，这些缓冲区与常规强化学习算法结合使用时，可以缓解当代理在时间上不均衡地接触任务时，现有回放缓冲区技术所面临的灾难性遗忘问题。我们在三种不同的持续强化学习设置中，评估了灾难性遗忘现象及我们所提缓冲区的效率，并与最新研究（如混合水库缓冲区HRB和多时间尺度回放缓冲区MTR）进行了比较。实验在经典控制任务和Metaworld环境中进行。实验结果表明，在大多数设置下，我们所提出的回放缓冲区相比现有工作展现出更好的抗灾难性遗忘能力。