Reinforcement learning with verifiable rewards has driven recent advances in LLM post-training, in particular for reasoning. Policy optimization algorithms generate a number of responses for a given prompt and then effectively weight the corresponding gradients depending on the rewards. The most popular algorithms including GRPO, DAPO, and RLOO focus on ambiguous prompts, i.e., prompts with intermediate success probability, while downgrading gradients with very easy and very hard prompts. In this paper, we consider asymmetric prompt weightings that assign higher weights to prompts with low, or even zero, empirical success probability. We find that asymmetric weighting particularly benefits from-scratch RL (as in R1-Zero), where training traverses a wide accuracy range, and less so in post-SFT RL where the model already starts at high accuracy. We also provide theory that characterizes prompt weights which minimize the time needed to raise success probability from an initial level to a target accuracy under a fixed update budget. In low-success regimes, where informative responses are rare and response cost dominates, these optimal weights become asymmetric, upweighting low success probabilities and thereby accelerating effective-time convergence.
翻译:可验证奖励的强化学习推动了近期大语言模型后训练(特别是推理能力方面)的进展。策略优化算法为给定提示生成若干响应,然后根据奖励对相应梯度进行有效加权。包括GRPO、DAPO和RLOO在内的最流行算法主要关注模糊提示(即具有中等成功概率的提示),同时降低对非常容易和非常困难提示的梯度权重。本文考虑非对称提示加权方案,该方案为具有低(甚至为零)经验成功概率的提示分配更高权重。我们发现,非对称加权特别有利于从零开始的强化学习(如R1-Zero),因为其训练过程覆盖了较宽的准确率范围;而在后监督微调的强化学习中,由于模型已具备较高初始准确率,其受益相对较小。我们还提供了理论分析,该理论刻画了在固定更新预算下,将成功概率从初始水平提升至目标准确率所需时间最小化的提示权重。在低成功率区间,由于信息性响应稀少且响应成本占主导地位,这些最优权重呈现非对称性,通过上加权低成功概率从而加速有效时间收敛。