Recent advances have demonstrated the effectiveness of Reinforcement Learning (RL) in improving the reasoning capabilities of Large Language Models (LLMs). However, existing works inevitably rely on high-quality instructions and verifiable rewards for effective training, both of which are often difficult to obtain in specialized domains. In this paper, we propose Self-play Reinforcement Learning (SeRL) to bootstrap LLM training with limited initial data. Specifically, SeRL comprises two complementary modules: self-instruction and self-rewarding. The former module generates additional instructions based on the available data at each training step, employing robust online filtering strategies to ensure instruction quality, diversity, and difficulty. The latter module introduces a simple yet effective majority-voting mechanism to estimate response rewards for additional instructions, eliminating the need for external annotations. Finally, SeRL performs conventional RL based on the generated data, facilitating iterative self-play learning. Extensive experiments on various reasoning benchmarks and across different LLM backbones demonstrate that the proposed SeRL yields results superior to its counterparts and achieves performance on par with those obtained by high-quality data with verifiable rewards. Our code is available at https://github.com/wantbook-book/SeRL.
翻译:近期研究表明,强化学习(RL)在提升大语言模型(LLMs)的推理能力方面成效显著。然而,现有方法不可避免地依赖于高质量指令与可验证奖励以实现有效训练,而这两者在专业领域中往往难以获取。本文提出自博弈强化学习(SeRL),旨在利用有限的初始数据引导大语言模型的训练。具体而言,SeRL包含两个互补模块:自指令生成与自奖励机制。前者基于每一步训练中可获得的数据生成额外指令,并采用鲁棒的在线过滤策略以确保指令的质量、多样性与难度;后者引入一种简洁而有效的多数投票机制来评估针对额外指令的响应奖励,从而无需外部标注。最终,SeRL基于生成的数据执行常规强化学习,实现迭代式的自博弈学习。在多种推理基准测试及不同大语言模型主干上的大量实验表明,所提出的SeRL方法优于现有同类方法,其性能可与使用高质量数据及可验证奖励所获得的结果相媲美。我们的代码公开于 https://github.com/wantbook-book/SeRL。