Multi-task reinforcement learning (MTRL) demonstrate potential for enhancing the generalization of a robot, enabling it to perform multiple tasks concurrently. However, the performance of MTRL may still be susceptible to conflicts between tasks and negative interference. To facilitate efficient MTRL, we propose Task-Specific Action Correction (TSAC), a general and complementary approach designed for simultaneous learning of multiple tasks. TSAC decomposes policy learning into two separate policies: a shared policy (SP) and an action correction policy (ACP). To alleviate conflicts resulting from excessive focus on specific tasks' details in SP, ACP incorporates goal-oriented sparse rewards, enabling an agent to adopt a long-term perspective and achieve generalization across tasks. Additional rewards transform the original problem into a multi-objective MTRL problem. Furthermore, to convert the multi-objective MTRL into a single-objective formulation, TSAC assigns a virtual expected budget to the sparse rewards and employs Lagrangian method to transform a constrained single-objective optimization into an unconstrained one. Experimental evaluations conducted on Meta-World's MT10 and MT50 benchmarks demonstrate that TSAC outperforms existing state-of-the-art methods, achieving significant improvements in both sample efficiency and effective action execution.
翻译:多任务强化学习(MTRL)展现出增强机器人泛化能力、使其能够同时执行多项任务的潜力。然而,MTRL的性能仍可能受到任务间冲突与负向干扰的影响。为促进高效的MTRL,我们提出任务特定动作纠正(TSAC),这是一种适用于多任务同步学习的通用且互补性方法。TSAC将策略学习分解为两个独立策略:共享策略(SP)与动作纠正策略(ACP)。为缓解SP因过度关注特定任务细节而引发的冲突,ACP引入面向目标的稀疏奖励,使智能体能够从长期视角实现跨任务泛化。附加奖励将原问题转化为多目标MTRL问题。此外,为将多目标MTRL转化为单目标形式,TSAC为稀疏奖励分配虚拟预期预算,并采用拉格朗日法将带约束的单目标优化转化为无约束优化。在Meta-World的MT10和MT50基准上的实验评估表明,TSAC优于现有最先进方法,在样本效率与有效动作执行方面均实现了显著提升。