Knowledge distillation, the technique of transferring knowledge from large, complex models to smaller ones, marks a pivotal step towards efficient AI deployment. Distilling Step-by-Step (DSS), a novel method utilizing chain-of-thought (CoT) distillation, has demonstrated promise by imbuing smaller models with the superior reasoning capabilities of their larger counterparts. In DSS, the distilled model acquires the ability to generate rationales and predict labels concurrently through a multi-task learning framework. However, DSS overlooks the intrinsic relationship between the two training tasks, leading to ineffective integration of CoT knowledge with the task of label prediction. To this end, we investigate the mutual relationship of the two tasks from Information Bottleneck perspective and formulate it as maximizing the mutual information of the representation features of the two tasks. We propose a variational approach to solve this optimization problem using a learning-based method. Our experimental results across four datasets demonstrate that our method outperforms the state-of-the-art DSS. Our findings offer insightful guidance for future research on language model distillation as well as applications involving CoT. Code and models will be released soon.
翻译:知识蒸馏,即从大型复杂模型向小型模型迁移知识的技术,是实现高效人工智能部署的关键步骤。逐步蒸馏(DSS)作为一种利用思维链(CoT)蒸馏的新方法,通过将大型模型卓越的推理能力注入小型模型,展现出了巨大潜力。在DSS中,蒸馏模型通过多任务学习框架同时获得生成推理过程和预测标签的能力。然而,DSS忽视了两项训练任务之间的内在关联,导致CoT知识与标签预测任务未能有效整合。为此,我们从信息瓶颈视角探究两项任务的相互关系,并将其形式化为最大化两项任务表示特征的互信息。我们提出一种基于学习的变分方法来解决该优化问题。在四个数据集上的实验结果表明,我们的方法超越了当前最优的DSS。我们的发现为未来语言模型蒸馏研究以及涉及CoT的应用提供了富有洞察力的指导。代码与模型将很快发布。