Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising approach for training reasoning language models (RLMs) by leveraging supervision from verifiers. Although verifier implementation is easier than solution annotation for many tasks, existing synthetic data generation methods remain largely solution-centric, while verifier-based methods rely on a few hand-crafted procedural environments. In this work, we scale RLVR by introducing ReSyn, a pipeline that generates diverse reasoning environments equipped with instance generators and verifiers, covering tasks such as constraint satisfaction, algorithmic puzzles, and spatial reasoning. A Qwen2.5-7B-Instruct model trained with RL on ReSyn data achieves consistent gains across reasoning benchmarks and out-of-domain math benchmarks, including a 27\% relative improvement on the challenging BBEH benchmark. Ablations show that verifier-based supervision and increased task diversity both contribute significantly, providing empirical evidence that generating reasoning environments at scale can enhance reasoning abilities in RLMs
翻译:基于可验证奖励的强化学习(RLVR)已成为一种通过利用验证器监督来训练推理语言模型(RLM)的有前景的方法。尽管对于许多任务而言,验证器的实现比解决方案标注更为容易,但现有的合成数据生成方法在很大程度上仍以解决方案为中心,而基于验证器的方法则依赖于少数手工设计的程序化环境。在本工作中,我们通过引入ReSyn来扩展RLVR,这是一个能够生成多样化推理环境的流程,这些环境配备了实例生成器和验证器,涵盖约束满足、算法谜题和空间推理等任务。一个使用ReSyn数据通过强化学习训练的Qwen2.5-7B-Instruct模型,在多个推理基准测试和领域外数学基准测试上均取得了稳定的性能提升,包括在具有挑战性的BBEH基准测试上实现了27%的相对改进。消融实验表明,基于验证器的监督和增加的任务多样性均贡献显著,这为大规模生成推理环境能够增强RLM的推理能力提供了实证证据。