Large-scale verifiable prompts underpin the success of Reinforcement Learning with Verifiable Rewards (RLVR), but they contain many uninformative examples and are costly to expand further. Recent studies focus on better exploiting limited training data by prioritizing hard prompts whose rollout pass rate is 0. However, easy prompts with a pass rate of 1 also become increasingly prevalent as training progresses, thereby reducing the effective data size. To mitigate this, we propose Composition-RL, a simple yet useful approach for better utilizing limited verifiable prompts targeting pass-rate-1 prompts. More specifically, Composition-RL automatically composes multiple problems into a new verifiable question and uses these compositional prompts for RL training. Extensive experiments across model sizes from 4B to 30B show that Composition-RL consistently improves reasoning capability over RL trained on the original dataset. Performance can be further boosted with a curriculum variant of Composition-RL that gradually increases compositional depth over training. Additionally, Composition-RL enables more effective cross-domain RL by composing prompts drawn from different domains. Codes, datasets, and models are available at https://github.com/XinXU-USTC/Composition-RL.
翻译:大规模可验证提示是带可验证奖励的强化学习(RLVR)成功的关键,但它们包含许多信息量不足的示例,且进一步扩展的成本高昂。近期研究侧重于通过优先处理通过率为0的困难提示来更好地利用有限的训练数据。然而,随着训练进行,通过率为1的简单提示也日益普遍,从而降低了有效数据规模。为缓解此问题,我们提出了Composition-RL,一种简单而有效的方法,旨在更好地利用针对通过率为1提示的有限可验证提示。具体而言,Composition-RL自动将多个问题组合成一个新的可验证问题,并将这些组合提示用于强化学习训练。在4B至30B不同模型规模上的大量实验表明,Composition-RL相较于在原始数据集上进行强化学习的模型,持续提升了推理能力。通过采用逐步增加组合深度的课程学习变体,性能可得到进一步提升。此外,Composition-RL通过组合来自不同领域的提示,实现了更有效的跨领域强化学习。代码、数据集和模型可在 https://github.com/XinXU-USTC/Composition-RL 获取。