Recent progress in Large Language Model (LLM) reasoning is increasingly driven by the refinement of post-training loss functions and alignment strategies. However, standard Reinforcement Learning (RL) paradigms like Group Relative Policy Optimization (GRPO) remain constrained by static uniformity: uniform prompt sampling and a fixed number of rollouts per prompt. For heterogeneous, heavy-tailed reasoning data, this creates structural inefficiencies that waste compute on already-solved patterns while under-training the long tail of hard problems. To address this, we propose Multi-Adversary Group Distributionally Robust Optimization (GDRO), an optimization-first framework that moves beyond uniform reasoning models by dynamically adapting the training distribution. We introduce an Online Difficulty Classifier that partitions prompts into dynamic pass@k difficulty groups. We then propose two independent GDRO games for post-training: (1) Prompt-GDRO, which employs an EMA-debiased multiplicative-weights bandit sampler to target the intensive difficulty margin and upweight persistently hard groups without frequency bias; and (2) Rollout-GDRO, which uses a shadow-price controller to reallocate rollouts across groups, maximizing gradient variance reduction on hard tasks under a fixed mean budget (compute-neutral). We provide no-regret guarantees for both controllers and additionally a variance-proxy analysis motivating a square-root optimal rollout allocation for Rollout-GDRO. We validate our framework on the DAPO 14.1k dataset using Qwen3-Base models. Prompt-GDRO and Rollout-GDRO achieve average relative gains of +10.6% and +10.1%, respectively, in pass@8 accuracy across 1.7B, 4B, and 8B scales compared to the GRPO baseline. Qualitative analysis shows an emergent curriculum: the adversaries shift resources to the evolving reasoning frontier, enhancing the reasoning model's performance.
翻译:近期大语言模型推理能力的进展日益依赖于训练后损失函数与对齐策略的优化。然而,诸如群体相对策略优化等标准强化学习范式仍受限于静态均匀性约束:即均匀的提示采样与每个提示固定次数的推理展开。对于异构、重尾分布的推理数据,这种机制会产生结构性低效问题,导致计算资源浪费在已解决的模式上,而对难题长尾部分的训练不足。为解决此问题,我们提出多对抗者群体分布鲁棒优化框架——一种以优化为核心的范式,通过动态调整训练分布超越了均匀推理模型的局限。我们引入在线难度分类器,将提示动态划分为基于pass@k指标的难度组。随后提出两个独立的训练后GDRO博弈机制:一是提示GDRO,采用指数移动平均去偏的乘性权重赌博机采样器,以聚焦高强度难度边界并提升持续困难组的权重,同时避免频率偏差;二是展开GDRO,通过影子价格控制器在组间重新分配推理展开次数,在固定平均计算预算下最大化困难任务的梯度方差缩减。我们为两个控制器提供了无悔保证,并额外通过方差代理分析论证了展开GDRO中平方根最优分配方案的理论依据。我们在DAPO 14.1k数据集上使用Qwen3-Base系列模型验证了本框架。相较于GRPO基线,提示GDRO与展开GDRO在1.7B、4B和8B规模模型的pass@8准确率上分别实现了平均10.6%和10.1%的相对提升。定性分析揭示了 emergent curriculum 现象:对抗者将资源动态转移至持续演进的推理前沿,从而显著增强了推理模型的性能表现。