Fine-tuning on generalized tasks such as instruction following, code generation, and mathematics has been shown to enhance language models' performance on a range of tasks. Nevertheless, explanations of how such fine-tuning influences the internal computations in these models remain elusive. We study how fine-tuning affects the internal mechanisms implemented in language models. As a case study, we explore the property of entity tracking, a crucial facet of language comprehension, where models fine-tuned on mathematics have substantial performance gains. We identify the mechanism that enables entity tracking and show that (i) in both the original model and its fine-tuned versions primarily the same circuit implements entity tracking. In fact, the entity tracking circuit of the original model on the fine-tuned versions performs better than the full original model. (ii) The circuits of all the models implement roughly the same functionality: Entity tracking is performed by tracking the position of the correct entity in both the original model and its fine-tuned versions. (iii) Performance boost in the fine-tuned models is primarily attributed to its improved ability to handle the augmented positional information. To uncover these findings, we employ: Patch Patching, DCM, which automatically detects model components responsible for specific semantics, and CMAP, a new approach for patching activations across models to reveal improved mechanisms. Our findings suggest that fine-tuning enhances, rather than fundamentally alters, the mechanistic operation of the model.
翻译:针对通用任务(如指令遵循、代码生成和数学推理)的微调已被证明能提升语言模型在多种任务上的性能。然而,微调如何影响模型内部计算机制的解释仍不明确。本研究聚焦微调对语言模型内部机制的影响,以实体追踪(语言理解的关键特性)为案例,探究经数学任务微调后模型性能显著提升的深层原因。我们识别出实现实体追踪的机制,并发现:(i) 原始模型及其微调版本主要使用相同电路实现实体追踪——事实上,原始模型的实体追踪电路在微调版本上的表现甚至优于完整原始模型;(ii) 所有模型电路实现的功能大致相同:原始模型及其微调版本均通过追踪正确实体的位置完成实体追踪;(iii) 微调模型性能提升主要归因于其处理增强位置信息能力的改善。为揭示这些发现,我们采用补丁修补(Patch Patching)、自动检测语义相关模型组件的DCM方法,以及跨模型激活修补新技术CMAP。研究结果表明,微调增强而非根本性改变了模型的机制运作方式。