“整体大于部分之和”：一种兼容性感知的多教师思维链蒸馏框架 ("The Whole Is Greater Than the Sum of Its Parts": A Compatibility-Aware Multi-Teacher CoT Distillation Framework)

Chain-of-Thought (CoT) reasoning empowers Large Language Models (LLMs) with remarkable capabilities but typically requires prohibitive parameter scales. CoT distillation has emerged as a promising paradigm to transfer reasoning prowess into compact Student Models (SLMs), but existing approaches often rely on a solitary teacher, capping the student's potential since individual LLMs often exhibit distinct capability biases and may suffer from catastrophic forgetting. While leveraging diverse teachers seems appealing, effectively fusing their supervisions remains challenging: teacher-student incompatibility risks amplifying hallucinations, and passive supervision fails to ensure genuine logic internalization. To address this, we introduce COMPACT, a framework that adaptively fuses supervisions from different teachers by dynamically weighting teacher gradients based on the student's real-time compatibility evaluated by a multi-dimensional metric: (1) Graph-based Consensus to filter misleading rationales by identifying mainstream reasoning paths; (2) Mutual-Information-based Adaptability to detect "epiphany moments" for genuinely understanding the reasoning process rather than merely imitating; and (3) Loss-based Difficulty to assess student receptivity to the teacher's guidance and prevent negative transfer. Extensive experiments and latent space analysis demonstrate that COMPACT effectively integrates diverse reasoning capabilities without damaging the model's original knowledge structure, achieving state-of-the-art performance on various benchmarks while mitigating catastrophic forgetting.

翻译：思维链（CoT）推理赋予大型语言模型（LLMs）卓越的能力，但通常需要极高的参数量级。CoT蒸馏已成为将推理能力迁移至紧凑型学生模型（SLMs）的有前景范式，但现有方法通常依赖单一教师，限制了学生的潜力，因为单个LLM往往表现出不同的能力偏差，并可能遭受灾难性遗忘。虽然利用多样化教师看似吸引人，但有效融合其监督仍具挑战性：师生不兼容可能放大幻觉生成，而被动监督无法确保真正的逻辑内化。为此，我们提出COMPACT框架，该框架通过基于多维度度量评估的学生实时兼容性动态加权教师梯度，从而自适应融合不同教师的监督：（1）基于图的共识性，通过识别主流推理路径过滤误导性推理依据；（2）基于互信息的适应性，检测“顿悟时刻”以实现对推理过程的真正理解而非单纯模仿；（3）基于损失的难度评估，衡量学生对教师指导的接受度并防止负迁移。大量实验和隐空间分析表明，COMPACT能有效整合多样化推理能力而不破坏模型原有知识结构，在多种基准测试中取得最先进性能，同时缓解了灾难性遗忘。