Reinforcement learning with verifiable rewards (RLVR) is effective for training large language models on deterministic outcome reasoning tasks. Prior work shows RLVR works with few prompts, but prompt selection is often based only on training-accuracy variance, leading to unstable optimization directions and weaker transfer. We revisit prompt selection from a mechanism-level view and argue that an effective minibatch should provide both (i) a reliable positive anchor and (ii) explicit negative learning signals from rare failures. Based on this principle, we propose \emph{positive--negative pairing}: at each update, we sample a hard-but-solvable $q^{+}$ and an easy-but-brittle prompt $q^{-}$(high success rate but not perfect), characterized by low and high empirical success rates under multiple rollouts. We further introduce Weighted GRPO, which reweights binary outcomes at the pair level and uses group-normalized advantages to amplify rare successes on $q^{+}$ into sharp positive guidance while turning rare failures on $q^{-}$ into strong negative penalties. This bidirectional signal provides informative learning feedback for both successes and failures, improving sample efficiency without suppressing exploration. On Qwen2.5-Math-7B, a single paired minibatch per update consistently outperforms a GRPO baseline that selects two prompts via commonly used variance-based selection heuristics: AIME~2025 Pass@8 improves from 16.8 to 22.2, and AMC23 Pass@64 from 94.0 to 97.0, while remaining competitive with large-scale RLVR trained from a pool of 1209 training prompts. Similar gains are observed on Qwen2.5-Math-7B-Instruct.
翻译:具有可验证奖励的强化学习(RLVR)在确定性结果推理任务上训练大语言模型方面是有效的。先前工作表明RLVR仅需少量提示即可工作,但提示选择通常仅基于训练准确率的方差,导致优化方向不稳定且泛化能力较弱。我们从机制层面重新审视提示选择,认为一个有效的小批量应同时提供(i)可靠的正向锚点与(ii)来自稀有失败的显式负向学习信号。基于此原则,我们提出**正-负配对**方法:在每次更新时,我们采样一个困难但可解的提示$q^{+}$和一个简单但脆弱的提示$q^{-}$(高成功率但不完美),其特征是在多次rollout下分别具有低和高经验成功率。我们进一步引入加权GRPO,该方法在配对层面重新加权二元结果,并使用组归一化优势将$q^{+}$上的稀有成功放大为尖锐的正向指导,同时将$q^{-}$上的稀有失败转化为强负向惩罚。这种双向信号为成功与失败均提供了信息丰富的学习反馈,在提升样本效率的同时不抑制探索。在Qwen2.5-Math-7B上,每次更新仅使用单个配对小批量,其表现持续优于通过常用基于方差的启发式方法选择两个提示的GRPO基线:AIME 2025 Pass@8从16.8提升至22.2,AMC23 Pass@64从94.0提升至97.0,同时与从1209个训练提示池中进行大规模RLVR训练的结果保持竞争力。在Qwen2.5-Math-7B-Instruct模型上也观察到类似的性能提升。