Learning to Modulate pre-trained Models in RL

Reinforcement Learning (RL) has been successful in various domains like robotics, game playing, and simulation. While RL agents have shown impressive capabilities in their specific tasks, they insufficiently adapt to new tasks. In supervised learning, this adaptation problem is addressed by large-scale pre-training followed by fine-tuning to new down-stream tasks. Recently, pre-training on multiple tasks has been gaining traction in RL. However, fine-tuning a pre-trained model often suffers from catastrophic forgetting, that is, the performance on the pre-training tasks deteriorates when fine-tuning on new tasks. To investigate the catastrophic forgetting phenomenon, we first jointly pre-train a model on datasets from two benchmark suites, namely Meta-World and DMControl. Then, we evaluate and compare a variety of fine-tuning methods prevalent in natural language processing, both in terms of performance on new tasks, and how well performance on pre-training tasks is retained. Our study shows that with most fine-tuning approaches, the performance on pre-training tasks deteriorates significantly. Therefore, we propose a novel method, Learning-to-Modulate (L2M), that avoids the degradation of learned skills by modulating the information flow of the frozen pre-trained model via a learnable modulation pool. Our method achieves state-of-the-art performance on the Continual-World benchmark, while retaining performance on the pre-training tasks. Finally, to aid future research in this area, we release a dataset encompassing 50 Meta-World and 16 DMControl tasks.

翻译：强化学习（RL）在机器人技术、游戏博弈和仿真等多个领域已取得成功。尽管RL智能体在特定任务上展现出令人瞩目的能力，但它们对新任务的适应仍显不足。在监督学习中，这一适应问题通过大规模预训练后针对下游新任务进行微调得以解决。近年来，基于多任务预训练在RL领域逐渐受到关注。然而，对预训练模型进行微调常面临灾难性遗忘问题，即在新任务上微调时，预训练任务的性能会显著下降。为探究灾难性遗忘现象，我们首先在来自Meta-World和DMControl两个基准套件的数据集上联合预训练一个模型。随后，我们评估并比较了自然语言处理中多种主流微调方法，既考察了它们在新任务上的性能表现，也考察了它们对预训练任务性能的保持能力。研究表明，大多数微调方法会导致预训练任务性能严重退化。为此，我们提出一种新方法——学习调制（L2M），该方法通过可学习的调制池调节冻结预训练模型的信息流，从而避免所学技能的退化。我们的方法在Continual-World基准上达到了最先进的性能，同时保持了预训练任务的性能。最后，为促进该领域的未来研究，我们发布了一个包含50个Meta-World任务和16个DMControl任务的数据集。