Reinforcement learning (RL) offers a principled way to enhance the reasoning capabilities of large language models, yet its effectiveness hinges on training signals that remain informative as models evolve. In practice, RL progress often slows when task difficulty becomes poorly aligned with model capability, or when training is dominated by a narrow set of recurring problem patterns. To jointly address these issues, we propose SCALER (Synthetic sCalable Adaptive Learning Environment for Reasoning), a framework that sustains effective learning signals through adaptive environment design. SCALER introduces a scalable synthesis pipeline that converts real-world programming problems into verifiable reasoning environments with controllable difficulty and unbounded instance generation, enabling RL training beyond finite datasets while preserving strong correctness guarantees. Building on this, SCALER further employs an adaptive multi-environment RL strategy that dynamically adjusts instance difficulty and curates the active set of environments to track the model's capability frontier and maintain distributional diversity. This co-adaptation prevents reward sparsity, mitigates overfitting to narrow task patterns, and supports sustained improvement throughout training. Extensive experiments show that SCALER consistently outperforms dataset-based RL baselines across diverse reasoning benchmarks and exhibits more stable, long-horizon training dynamics.
翻译:强化学习(RL)为大语言模型的推理能力增强提供了原则性途径,但其有效性高度依赖于随模型演进仍能保持信息量的训练信号。实践中,当任务难度与模型能力不匹配,或训练被少数重复出现的模式主导时,RL 的进展往往趋于停滞。为协同解决上述问题,我们提出 SCALER(合成可扩展自适应推理学习环境),这是一种通过自适应环境设计来维持有效学习信号的框架。SCALER 引入可扩展的合成流程,将现实编程问题转化为难度可控、实例生成无上限的可验证推理环境,使得 RL 训练可突破有限数据集限制,同时保持严格的正确性保证。在此基础上,SCALER 进一步采用自适应多环境 RL 策略:动态调整实例难度并精选活跃环境集,以追踪模型能力前沿并维持分布多样性。这种协同适应机制可防止奖励稀疏性、缓解对狭窄任务模式的过拟合,并在整个训练过程中支持持续改进。大量实验表明,SCALER 在各类推理基准测试中始终优于基于数据集的 RL 基线,展现出更稳定、更长周期的训练动态特性。