Model distillation typically focuses on behavioral mimicry, where a student model is trained to replicate a teacher's output while treating its internal computations as a black box. In this work we propose an alternative approach: Distilling the underlying computational mechanisms implemented by a teacher model. Specifically, we propose circuit distillation, which introduces an objective to align internal representations between analogous circuit components in teacher and student models. We propose a method to match ``functionally correspondent'' circuit components and introduce a loss reflecting similarities between the representations that these induce. We evaluate circuit distillation on entity tracking and theory of mind (ToM) tasks using models from the Llama3 family. Our results demonstrate that circuit distillation outperforms standard distillation, successfully transferring algorithmic capabilities by adjusting only a small, targeted subset of student model parameters. This work establishes the feasibility of transferring mechanisms, which may in turn allow for efficient distillation of targeted teacher capabilities via interpretable and controllable internal student mechanisms.
翻译:模型蒸馏通常侧重于行为模仿,即训练学生模型复制教师模型的输出,同时将其内部计算视为黑箱。本文提出一种替代方法:蒸馏教师模型实现的底层计算机制。具体而言,我们提出电路蒸馏方法,引入一种目标函数来对齐教师模型与学生模型中对应电路组件的内部表征。我们提出一种匹配"功能对应"电路组件的方法,并引入反映这些组件所诱导表征间相似性的损失函数。我们在实体追踪和心理理论任务上使用Llama3系列模型评估电路蒸馏方法。实验结果表明,电路蒸馏优于标准蒸馏方法,通过仅调整学生模型中少量目标参数子集,成功实现了算法能力的迁移。这项工作验证了机制迁移的可行性,从而可能通过可解释且可控的学生模型内部机制,实现针对性的教师模型能力高效蒸馏。