Continual reinforcement learning poses a major challenge due to the tendency of agents to experience catastrophic forgetting when learning sequential tasks. In this paper, we introduce a modularity-based approach, called Hierarchical Orchestra of Policies (HOP), designed to mitigate catastrophic forgetting in lifelong reinforcement learning. HOP dynamically forms a hierarchy of policies based on a similarity metric between the current observations and previously encountered observations in successful tasks. Unlike other state-of-the-art methods, HOP does not require task labelling, allowing for robust adaptation in environments where boundaries between tasks are ambiguous. Our experiments, conducted across multiple tasks in a procedurally generated suite of environments, demonstrate that HOP significantly outperforms baseline methods in retaining knowledge across tasks and performs comparably to state-of-the-art transfer methods that require task labelling. Moreover, HOP achieves this without compromising performance when tasks remain constant, highlighting its versatility.
翻译:持续强化学习面临重大挑战,主要源于智能体在顺序任务学习中容易发生灾难性遗忘。本文提出一种基于模块化的方法,称为分层策略编排(HOP),旨在缓解终身强化学习中的灾难性遗忘问题。HOP 根据当前观测与先前成功任务中已遇观测之间的相似性度量,动态构建策略层次结构。与其他先进方法不同,HOP 无需任务标注,能够在任务边界模糊的环境中实现鲁棒适应。我们在程序化生成的环境套件中开展多任务实验,结果表明 HOP 在跨任务知识保留方面显著优于基线方法,其性能与需要任务标注的先进迁移方法相当。此外,HOP 在任务保持恒定时仍能维持性能,体现了其多功能性。