While scaling laws for large language models (LLMs) during pre-training have been extensively studied, their behavior under reinforcement learning (RL) post-training remains largely unexplored. This paper presents a systematic empirical investigation of scaling behaviors in RL-based post-training, with a particular focus on mathematical reasoning. Based on a set of experiments across the full Qwen2.5 dense model series (0.5B to 72B), we characterize how model scale, data volume, and computational budget interact to shape performance. Our analysis leads to four key findings: 1. Larger models consistently exhibit superior learning efficiency on both compute and data metrics. 2. The relationship between test loss, compute, and data can be modeled by a predictive power-law which is robust across both base and instruction-tuned models. 3. Although larger models exhibit higher learning efficiency, the analytical learning efficiency term k(N) in the power-law reveals a latent saturation trend in learning efficiency as model size continues to increase. 4. In data-constrained regimes, repeated reuse of high-quality data proves highly effective, as final performance is primarily governed by the total number of optimization steps rather than the uniqueness of samples. Collectively, these results provide a principled foundation and practical guidelines for efficiently scaling the reasoning capabilities of LLMs through RL post-training.
翻译:尽管大型语言模型(LLM)在预训练阶段的扩展规律已得到广泛研究,但其在强化学习(RL)后训练中的行为仍鲜有探索。本文系统性地实证研究了基于RL的后训练扩展行为,特别聚焦于数学推理任务。通过基于完整Qwen2.5密集模型系列(0.5B至72B)的实验,我们刻画了模型规模、数据量和计算预算如何交互影响性能。分析得出四项关键发现:1)更大规模的模型在计算和数据维度上始终展现出更高的学习效率;2)测试损失、计算量与数据量之间的关系可由预测性幂律模型刻画,该模型在基座模型与指令微调模型中均保持稳健;3)尽管大规模模型具有更高学习效率,但幂律中的解析性学习效率项k(N)揭示了随模型规模持续增大,学习效率存在潜在的饱和趋势;4)在数据受限场景下,重复使用高质量数据具有显著有效性,这是因为最终性能主要由优化总步数而非样本独特性决定。综合而言,这些结果为通过RL后训练高效扩展LLM推理能力提供了原则性基础与实践指导。