Attribution-Guided Pruning for Insight and Control: Circuit Discovery and Targeted Correction in Small-scale LLMs

Sayed Mohammad Vakilzadeh Hatefi,Maximilian Dreyer,Reduan Achtibat,Patrick Kahardipraja,Thomas Wiegand,Wojciech Samek,Alexander Binder,Sebastian Lapuschkin

from arxiv, Work in progress (9 pages manuscript, 3 pages references, 16 pages appendix)

Large Language Models (LLMs) are widely deployed in real-world applications, yet their internal mechanisms remain difficult to interpret and control, limiting our ability to diagnose and correct undesirable behaviors. Mechanistic interpretability addresses this challenge by identifying circuits -- subsets of model components responsible for specific behaviors. However, discovering such circuits in LLMs remains difficult due to their scale and complexity. We propose an attribution-guided pruning approach for circuit discovery based on Layer-wise Relevance Propagation (LRP). By attributing model outputs to internal components using task-specific reference samples, we identify behaviorally relevant parameters and extract sparse functional circuits. Building on this, we introduce contrastive relevance to isolate circuits associated with undesired behaviors while preserving general capabilities, enabling targeted model correction. On OPT-125M, removing only 100 neurons (0.3%) significantly reduces toxic outputs, while pruning approximately 0.03% of weight elements mitigates repetitive text generation without degrading general performance. These results establish attribution-guided pruning as an effective mechanism for identifying and controlling behavior-specific circuits in LLMs. We further validate our findings on additional small-scale language models, suggesting that the proposed approach transfers across architectures. Our code is publicly available at https://github.com/erfanhatefi/SparC3.

翻译：大型语言模型（LLMs）已广泛应用于实际场景，但其内部机制仍难以解释和控制，限制了我们对不良行为的诊断与修正能力。机制可解释性通过识别电路——即负责特定行为的模型组件子集——来应对这一挑战。然而，由于LLMs的规模和复杂性，发现此类电路仍十分困难。我们提出了一种基于层间相关性传播（LRP）的属性引导剪枝方法用于电路发现。通过利用任务特定参考样本将模型输出归因至内部组件，我们识别出行为相关参数并提取稀疏功能电路。在此基础上，我们引入对比相关性来隔离与不良行为相关的电路，同时保留通用能力，从而实现目标模型校正。在OPT-125M模型上，仅移除100个神经元（0.3%）即可显著减少毒性输出，而剪除约0.03%的权重元素可缓解重复文本生成现象且不降低通用性能。这些结果证明了属性引导剪枝作为识别和控制LLMs中行为特定电路的有效机制。我们进一步在额外的小型语言模型上验证了该发现，表明所提方法可跨架构迁移。代码已开源发布于https://github.com/erfanhatefi/SparC3。

相关内容

属性

关注 2

一个具体事物，总是有许许多多的性质与关系，我们把一个事物的性质与关系，都叫作事物的属性。事物与属性是不可分的，事物都是有属性的事物，属性也都是事物的属性。一个事物与另一个事物的相同或相异，也就是一个事物的属性与另一事物的属性的相同或相异。由于事物属性的相同或相异，客观世界中就形成了许多不同的事物类。具有相同属性的事物就形成一类，具有不同属性的事物就分别地形成不同的类。

稀疏自编码器综述：解释大语言模型的内部机制

专知会员服务

17+阅读 · 2025年12月27日

【新书】设计大型语言模型应用：一种面向LLMs的整体方法

专知会员服务

56+阅读 · 2025年3月16日

从基础到突破的LLM微调终极指南：技术、研究、最佳实践、应用研究挑战与机遇的全面综述

专知会员服务

56+阅读 · 2024年11月17日

LLM4Science怎么做？UIUC等最新《科学大型语言模型及其在科学发现中的应用》综述

专知会员服务

35+阅读 · 2024年6月23日