Multi-task reinforcement learning endeavors to accomplish a set of different tasks with a single policy. To enhance data efficiency by sharing parameters across multiple tasks, a common practice segments the network into distinct modules and trains a routing network to recombine these modules into task-specific policies. However, existing routing approaches employ a fixed number of modules for all tasks, neglecting that tasks with varying difficulties commonly require varying amounts of knowledge. This work presents a Dynamic Depth Routing (D2R) framework, which learns strategic skipping of certain intermediate modules, thereby flexibly choosing different numbers of modules for each task. Under this framework, we further introduce a ResRouting method to address the issue of disparate routing paths between behavior and target policies during off-policy training. In addition, we design an automatic route-balancing mechanism to encourage continued routing exploration for unmastered tasks without disturbing the routing of mastered ones. We conduct extensive experiments on various robotics manipulation tasks in the Meta-World benchmark, where D2R achieves state-of-the-art performance with significantly improved learning efficiency.
翻译:多任务强化学习致力于通过单一策略完成一组不同的任务。为了通过跨任务共享参数来提高数据效率,常见做法是将网络划分为不同模块,并训练一个路由网络将这些模块重新组合为任务特定策略。然而,现有路由方法对所有任务采用固定数量的模块,忽略了难度不同的任务通常需要不同数量的知识。本文提出动态深度路由框架,该框架学习策略性地跳过某些中间模块,从而为每个任务灵活选择不同数量的模块。在此框架下,我们进一步引入ResRouting方法,以解决离线训练过程中行为策略与目标策略之间路由路径差异的问题。此外,我们设计了自动路由平衡机制,以鼓励对未掌握任务持续进行路由探索,同时避免干扰已掌握任务的路由。我们在Meta-World基准的多种机器人操作任务上进行了大量实验,结果表明D2R实现了最先进的性能,并显著提升了学习效率。