Multi-task reinforcement learning endeavors to accomplish a set of different tasks with a single policy. To enhance data efficiency by sharing parameters across multiple tasks, a common practice segments the network into distinct modules and trains a routing network to recombine these modules into task-specific policies. However, existing routing approaches employ a fixed number of modules for all tasks, neglecting that tasks with varying difficulties commonly require varying amounts of knowledge. This work presents a Dynamic Depth Routing (D2R) framework, which learns strategic skipping of certain intermediate modules, thereby flexibly choosing different numbers of modules for each task. Under this framework, we further introduce a ResRouting method to address the issue of disparate routing paths between behavior and target policies during off-policy training. In addition, we design an automatic route-balancing mechanism to encourage continued routing exploration for unmastered tasks without disturbing the routing of mastered ones. We conduct extensive experiments on various robotics manipulation tasks in the Meta-World benchmark, where D2R achieves state-of-the-art performance with significantly improved learning efficiency.
翻译:多任务强化学习旨在通过单一策略完成一组不同的任务。为了通过跨多个任务共享参数来提高数据效率,常见做法是将网络划分为不同模块,并训练一个路由网络将这些模块重组为特定于任务的策略。然而,现有的路由方法对所有任务使用固定数量的模块,忽略了不同难度的任务通常需要不同数量的知识。本文提出了一种动态深度路由(D2R)框架,该框架学习策略性地跳过某些中间模块,从而灵活地为每个任务选择不同数量的模块。在此框架下,我们进一步引入了一种ResRouting方法,以解决离线训练期间行为策略与目标策略之间路由路径不一致的问题。此外,我们设计了一种自动路由平衡机制,以鼓励对尚未掌握的任务进行持续的路由探索,同时不影响对已掌握任务的路由。我们在Meta-World基准测试中的各种机器人操作任务上进行了大量实验,结果表明D2R以显著提升的学习效率达到了最先进的性能。