Knowledge distillation compresses a larger neural model (teacher) into smaller, faster student models by training them to match teacher outputs. However, the internal computational transformations that occur during this process remain poorly understood. We apply techniques from mechanistic interpretability to analyze how internal circuits, representations, and activation patterns differ between teachers and students. Focusing on GPT2 and its distilled counterpart DistilGPT2, and generalizing our findings to both bidirectional architectures and larger model pairs, we find that student models can reorganize, compress, and discard teacher components, often resulting in a stronger reliance on fewer individual components. To quantify functional alignment beyond output similarity, we introduce an alignment metric based on influence-weighted component similarity, validated across multiple tasks. Our findings reveal that while knowledge distillation preserves broad functional behaviors, it also causes significant shifts in internal computation, with important implications for the robustness and generalization capacity of distilled models.
翻译:知识蒸馏通过训练较小的学生模型以匹配较大教师模型的输出,从而实现模型压缩。然而,这一过程中发生的内部计算变换机制仍不明确。本研究应用机理可解释性技术,分析教师模型与学生模型在内部电路、表征和激活模式上的差异。以GPT2及其蒸馏版本DistilGPT2为主要研究对象,并将发现推广至双向架构及更大规模的模型对,我们发现学生模型能够重组、压缩甚至舍弃教师模型的某些组件,往往导致其对更少独立组件的依赖增强。为量化超越输出相似度的功能对齐程度,我们提出了一种基于影响力加权组件相似度的对齐度量方法,并在多个任务上进行了验证。研究结果表明,尽管知识蒸馏保留了宏观功能行为,但也引发了内部计算的显著变化,这对蒸馏模型的鲁棒性与泛化能力具有重要影响。