Multi-task reinforcement learning (MTRL) aims to learn several tasks simultaneously for better sample efficiency than learning them separately. Traditional methods achieve this by sharing parameters or relabeled data between tasks. In this work, we introduce a new framework for sharing behavioral policies across tasks, which can be used in addition to existing MTRL methods. The key idea is to improve each task's off-policy data collection by employing behaviors from other task policies. Selectively sharing helpful behaviors acquired in one task to collect training data for another task can lead to higher-quality trajectories, leading to more sample-efficient MTRL. Thus, we introduce a simple and principled framework called Q-switch mixture of policies (QMP) that selectively shares behavior between different task policies by using the task's Q-function to evaluate and select useful shareable behaviors. We theoretically analyze how QMP improves the sample efficiency of the underlying RL algorithm. Our experiments show that QMP's behavioral policy sharing provides complementary gains over many popular MTRL algorithms and outperforms alternative ways to share behaviors in various manipulation, locomotion, and navigation environments. Videos are available at https://qmp-mtrl.github.io.
翻译:多任务强化学习(MTRL)旨在同时学习多个任务,以获得比单独学习更高的样本效率。传统方法通过在任务间共享参数或重标注数据来实现这一目标。本文提出了一种在任务间共享行为策略的新框架,该框架可与现有MTRL方法结合使用。其核心思想是通过利用其他任务策略的行为,改进每个任务的离策略数据收集。选择性地将在一个任务中习得的有益行为用于为另一任务收集训练数据,能够产生更高质量的轨迹,从而实现更高样本效率的MTRL。为此,我们引入了一个简单且原理清晰的框架——Q开关策略混合(QMP),该框架利用任务的Q函数评估并选择可共享的有用行为,从而实现不同任务策略间的选择性行为共享。我们从理论上分析了QMP如何提升底层强化学习算法的样本效率。实验表明,QMP的行为策略共享为多种主流MTRL算法提供了互补性增益,并在多种操作、运动及导航环境中优于其他行为共享方法。演示视频详见 https://qmp-mtrl.github.io。