Recent advancements in large language models (LLMs) have demonstrated remarkable reasoning capabilities through long chain-of-thought (CoT) reasoning. The R1 distillation scheme has emerged as a promising approach for training cost-effective models with enhanced reasoning abilities. However, the underlying mechanisms driving its effectiveness remain unclear. This study examines the universality of distillation data and identifies key components that enable the efficient transfer of long-chain reasoning capabilities in LLM distillation. Our findings reveal that the effectiveness of long CoT reasoning distillation from teacher models like Qwen-QwQ degrades significantly on nonhomologous models, challenging the assumed universality of current distillation methods. To gain deeper insights into the structure and patterns of long CoT reasoning, we propose DLCoT (Deconstructing Long Chain-of-Thought), a distillation data enhancement framework. DLCoT consists of three key steps: (1) data segmentation to decompose complex long CoT structures, (2) simplification by eliminating unsolvable and redundant solutions, and (3) optimization of intermediate error states. Our approach significantly improves model performance and token efficiency, facilitating the development of high-performance LLMs.
翻译:近年来,大型语言模型(LLM)通过长思维链(CoT)推理展现了卓越的推理能力。R1蒸馏方案已成为训练具备增强推理能力且成本效益高的模型的一种有前景的方法。然而,驱动其有效性的内在机制尚不明确。本研究检验了蒸馏数据的普适性,并识别了在LLM蒸馏中实现长链推理能力高效迁移的关键组成部分。我们的研究结果表明,从如Qwen-QwQ等教师模型进行的长CoT推理蒸馏,其有效性在非同源模型上显著下降,这对当前蒸馏方法所假定的普适性提出了挑战。为了更深入地理解长CoT推理的结构与模式,我们提出了DLCoT(解构长思维链),一个蒸馏数据增强框架。DLCoT包含三个关键步骤:(1)数据分割以分解复杂的长CoT结构,(2)通过消除不可解及冗余的解决方案进行简化,以及(3)优化中间错误状态。我们的方法显著提升了模型性能和令牌效率,有助于高性能LLM的开发。