In reinforcement learning (RL), key components of many algorithms are the exploration strategy and replay buffer. These strategies regulate what environment data is collected and trained on and have been extensively studied in the RL literature. In this paper, we investigate the impact of these components in the context of generalisation in multi-task RL. We investigate the hypothesis that collecting and training on more diverse data from the training environments will improve zero-shot generalisation to new tasks. We motivate mathematically and show empirically that generalisation to tasks that are "reachable'' during training is improved by increasing the diversity of transitions in the replay buffer. Furthermore, we show empirically that this same strategy also shows improvement for generalisation to similar but "unreachable'' tasks which could be due to improved generalisation of the learned latent representations.
翻译:在强化学习中,探索策略和经验回放缓冲区是许多算法中的关键组成部分。这些策略控制着环境数据的收集与训练方式,并在强化学习文献中得到了广泛研究。本文旨在探究这些组件在多任务强化学习泛化语境下的影响。我们验证了一个假设:通过收集并训练来自训练环境的更多样化数据,能够提升对全新任务的零样本泛化能力。我们从数学上论证并实验表明,增加经验回放缓冲区中转换样本的多样性,可以改善对训练过程中"可达"任务的泛化效果。此外,实验结果显示,这一策略同样能够提升对相似但"不可达"任务的泛化性能,这或许得益于所学隐层表征的泛化能力增强。