Recent supervised fine-tuning (SFT) approaches have significantly improved language models' performance on mathematical reasoning tasks, even when models are trained at a small scale. However, the specific capabilities enhanced through such fine-tuning remain poorly understood. In this paper, we conduct a detailed analysis of model performance on the AIME24 dataset to understand how reasoning capabilities evolve. We discover a ladder-like structure in problem difficulty, categorize questions into four tiers (Easy, Medium, Hard, and Extremely Hard (Exh)), and identify the specific requirements for advancing between tiers. We find that progression from Easy to Medium tier requires adopting an R1 reasoning style with minimal SFT (500-1K instances), while Hard-level questions suffer from frequent model's errors at each step of the reasoning chain, with accuracy plateauing at around 65% despite logarithmic scaling. Exh-level questions present a fundamentally different challenge; they require unconventional problem-solving skills that current models uniformly struggle with. Additional findings reveal that carefully curated small-scale datasets offer limited advantage-scaling dataset size proves far more effective. Our analysis provides a clearer roadmap for advancing language model capabilities in mathematical reasoning.
翻译:最近的监督微调(SFT)方法显著提升了语言模型在数学推理任务上的性能,即使模型是在小规模上训练的。然而,通过此类微调增强的具体能力仍不甚明了。在本文中,我们对模型在AIME24数据集上的表现进行了详细分析,以理解推理能力如何演变。我们发现了问题难度呈现阶梯状结构,将问题划分为四个层级(简单、中等、困难、极难),并识别出在层级间进阶所需的具体要求。我们发现,从简单层级进阶到中等层级需要采用R1推理风格并辅以最少的SFT(500-1K个实例),而困难层级的问题则因模型在推理链的每一步频繁出错而受阻,尽管存在对数尺度增长,其准确率仍停滞在约65%。极难层级的问题提出了根本不同的挑战;它们需要非常规的问题解决技能,而当前模型普遍难以应对。其他发现表明,精心策划的小规模数据集带来的优势有限——扩大数据集规模被证明远为有效。我们的分析为推进语言模型在数学推理方面的能力提供了更清晰的路线图。