Large Language Models (LLMs) have achieved remarkable success in natural language processing tasks, with Reinforcement Learning (RL) playing a key role in adapting them to specific applications. In mathematical problem solving, however, the reliance on ground truth answers poses significant challenges due to their high collection cost and limited availability. This work explores the use of simple surrogate signals, format and length, to guide RL training. We find that early training is dominated by format learning, where structural feedback alone accounts for most performance gains. Incorporating length-based rewards further refines outputs by discouraging overly long or short responses, enabling a GRPO approach with format-length signals to approximate, and in some cases surpass, ground-truth-based optimization. For example, our method achieves 40.0% accuracy on AIME2024 with a 7B base model, and generalizes across different model sizes and series. Beyond practical efficiency, these findings provide an inspirational perspective on RL: rather than imparting new knowledge, RL primarily activates reasoning capabilities already embedded in pre-trained models. This insight suggests that lightweight, label-efficient strategies can complement pre-training to unlock LLMs' latent potential in reasoning-intensive tasks.
翻译:大型语言模型(LLM)在自然语言处理任务中取得了显著成功,其中强化学习(RL)在使其适应特定应用场景方面发挥了关键作用。然而在数学问题求解中,对真实答案的依赖因其高昂的采集成本和有限的可获得性而带来重大挑战。本研究探索利用格式与长度这两种简单的替代信号来指导强化学习训练。我们发现早期训练主要由格式学习主导,仅凭结构反馈即可实现大部分性能提升。引入基于长度的奖励机制能通过抑制过长或过短的响应进一步优化输出,使得采用格式-长度信号的GRPO方法能够逼近甚至在某些情况下超越基于真实答案的优化效果。例如,我们的方法在7B基础模型上实现了AIME2024数据集40.0%的准确率,并能泛化至不同模型规模与系列。除了实际效率优势外,这些发现为强化学习提供了启发性的视角:强化学习并非注入新知识,而主要是激活预训练模型中已内嵌的推理能力。这一见解表明,轻量级、低标注依赖的策略能够与预训练形成互补,从而释放大型语言模型在推理密集型任务中的潜在能力。