RL-based post-training with GRPO is widely used to improve large language models on individual reasoning tasks. However, real-world deployment requires reliable performance across diverse tasks. A straightforward multi-task adaptation of GRPO often leads to imbalanced outcomes, with some tasks dominating optimization while others stagnate. Moreover, tasks can vary widely in how frequently prompts yield zero advantages (and thus zero gradients), which further distorts their effective contribution to the optimization signal. To address these issues, we propose a novel Multi-Task GRPO (MT-GRPO) algorithm that (i) dynamically adapts task weights to explicitly optimize worst-task performance and promote balanced progress across tasks, and (ii) introduces a ratio-preserving sampler to ensure task-wise policy gradients reflect the adapted weights. Experiments on both 3-task and 9-task settings show that MT-GRPO consistently outperforms baselines in worst-task accuracy. In particular, MT-GRPO achieves 16-28% and 6% absolute improvement on worst-task performance over standard GRPO and DAPO, respectively, while maintaining competitive average accuracy. Moreover, MT-GRPO requires 50% fewer training steps to reach 50% worst-task accuracy in the 3-task setting, demonstrating substantially improved efficiency in achieving reliable performance across tasks.
翻译:基于强化学习的GRPO后训练被广泛用于提升大语言模型在单一推理任务上的表现。然而,实际部署需要模型在多样化任务上均具备可靠性能。对GRPO进行简单的多任务适配常导致不平衡的结果,部分任务主导优化过程而其他任务停滞不前。此外,不同任务在提示产生零优势(进而导致零梯度)的频率上存在显著差异,这进一步扭曲了它们对优化信号的实际贡献。为解决这些问题,我们提出了一种新颖的多任务GRPO(MT-GRPO)算法,该算法(i)动态调整任务权重以显式优化最差任务性能,促进跨任务的均衡进展;(ii)引入比率保持采样器,确保任务层面的策略梯度反映调整后的权重。在3任务和9任务设置上的实验表明,MT-GRPO在最差任务准确率上持续优于基线方法。具体而言,相较于标准GRPO和DAPO,MT-GRPO在最差任务性能上分别实现了16-28%和6%的绝对提升,同时保持了具有竞争力的平均准确率。此外,在3任务设置中,MT-GRPO仅需50%的训练步数即可达到50%的最差任务准确率,这显著提升了实现跨任务可靠性能的训练效率。