Reinforcement learning (RL) has played a central role in recent advances in large reasoning models (LRMs), yielding strong gains in verifiable and open-ended reasoning. However, training a single general-purpose LRM across diverse domains remains challenging due to pronounced domain heterogeneity. Through a systematic study of two widely used strategies, Sequential RL and Mixed RL, we find that both incur substantial cross-domain interference at the behavioral and gradient levels, resulting in limited overall gains. To address these challenges, we introduce **M**odular **G**radient **S**urgery (**MGS**), which resolves gradient conflicts at the module level within the transformer. When applied to Llama and Qwen models, MGS achieves average improvements of 4.3 (16.6\%) and 4.5 (11.1\%) points, respectively, over standard multi-task RL across three representative domains (math, general chat, and instruction following). Further analysis demonstrates that MGS remains effective under prolonged training. Overall, our study clarifies the sources of interference in multi-domain RL and presents an effective solution for training general-purpose LRMs.
翻译:强化学习(RL)在近期大型推理模型(LRM)的进展中发挥了核心作用,在可验证和开放式推理方面带来了显著提升。然而,由于显著的领域异质性,在多样化领域训练单一通用LRM仍然具有挑战性。通过对两种广泛使用的策略——顺序RL与混合RL——的系统性研究,我们发现两者在行为层面和梯度层面均会产生严重的跨领域干扰,导致整体收益受限。为解决这些挑战,我们提出了**模块化梯度手术(MGS)**,该方法在Transformer内部模块层面解决梯度冲突。当应用于Llama和Qwen模型时,在三个代表性领域(数学、通用对话和指令遵循)上,MGS相较于标准多任务RL分别实现了平均4.3(16.6%)和4.5(11.1%)个百分点的提升。进一步分析表明,MGS在长时间训练下依然有效。总体而言,我们的研究阐明了多领域RL中干扰的来源,并为训练通用LRM提供了一种有效的解决方案。