O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?

This paper presents a critical examination of current approaches to replicating OpenAI's O1 model capabilities, with particular focus on the widespread but often undisclosed use of knowledge distillation techniques. While our previous work explored the fundamental technical path to O1 replication, this study reveals how simple distillation from O1's API, combined with supervised fine-tuning, can achieve superior performance on complex mathematical reasoning tasks. Through extensive experiments, we show that a base model fine-tuned on simply tens of thousands of samples O1-distilled long-thought chains outperforms O1-preview on the American Invitational Mathematics Examination (AIME) with minimal technical complexity. Moreover, our investigation extends beyond mathematical reasoning to explore the generalization capabilities of O1-distilled models across diverse tasks: hallucination, safety and open-domain QA. Notably, despite training only on mathematical problem-solving data, our models demonstrated strong generalization to open-ended QA tasks and became significantly less susceptible to sycophancy after fine-tuning. We deliberately make this finding public to promote transparency in AI research and to challenge the current trend of obscured technical claims in the field. Our work includes: (1) A detailed technical exposition of the distillation process and its effectiveness, (2) A comprehensive benchmark framework for evaluating and categorizing O1 replication attempts based on their technical transparency and reproducibility, (3) A critical discussion of the limitations and potential risks of over-relying on distillation approaches, our analysis culminates in a crucial bitter lesson: while the pursuit of more capable AI systems is important, the development of researchers grounded in first-principles thinking is paramount.

翻译：本文对当前复现 OpenAI O1 模型能力的方法进行了批判性审视，特别关注了知识蒸馏技术的广泛但往往未公开的使用。虽然我们之前的工作探索了 O1 复现的基本技术路径，但本研究揭示了如何通过简单的 O1 API 蒸馏结合监督微调，在复杂数学推理任务上实现更优性能。通过大量实验，我们表明，一个仅基于数万个 O1 蒸馏的长思维链样本进行微调的基础模型，在技术复杂度极低的情况下，于美国数学邀请赛（AIME）上超越了 O1-preview。此外，我们的研究超越了数学推理，探索了 O1 蒸馏模型在多样化任务上的泛化能力：幻觉、安全性和开放域问答。值得注意的是，尽管仅在数学解题数据上进行训练，我们的模型在开放域问答任务上表现出强大的泛化能力，并且在微调后显著降低了迎合性倾向。我们特意公开此发现，以促进人工智能研究的透明度，并挑战当前领域内技术主张模糊不清的趋势。我们的工作包括：（1）对蒸馏过程及其有效性的详细技术阐述，（2）一个基于技术透明度和可复现性来评估和分类 O1 复现尝试的全面基准框架，（3）对过度依赖蒸馏方法的局限性和潜在风险的批判性讨论。我们的分析最终得出一个关键的苦涩教训：虽然追求更强大的人工智能系统很重要，但培养基于第一性原理思维的研究人员才是至关重要的。