Circuit discovery aims to identify minimal subnetworks that are responsible for specific behaviors in large language models (LLMs). Existing approaches primarily rely on iterative edge pruning, which is computationally expensive and limited to coarse-grained units such as attention heads or MLP blocks, overlooking finer structures like individual neurons. We propose a node-level pruning framework for circuit discovery that addresses both scalability and granularity limitations. Our method introduces learnable masks across multiple levels of granularity, from entire blocks to individual neurons, within a unified optimization objective. Granularity-specific sparsity penalties guide the pruning process, allowing a comprehensive compression in a single fine-tuning run. Empirically, our approach identifies circuits that are smaller in nodes than those discovered by prior methods; moreover, we demonstrate that many neurons deemed important by coarse methods are actually irrelevant, while still maintaining task performance. Furthermore, our method has a significantly lower memory footprint, 5-10x, as it does not require keeping intermediate activations in the memory to work.
翻译:回路发现旨在识别大型语言模型中负责特定行为的最小化子网络。现有方法主要依赖迭代式边剪枝,其计算成本高昂且局限于注意力头或MLP块等粗粒度单元,忽略了单个神经元等更精细结构。我们提出一种面向回路发现的节点级剪枝框架,解决了可扩展性与粒度限制两大问题。该方法在统一优化目标下引入跨多个粒度层级(从完整模块到单个神经元)的可学习掩码。粒度特定的稀疏性惩罚引导剪枝过程,在单次微调运行中实现全面压缩。实验表明,与先前方法相比,本方法发现的回路节点数量更少;此外,我们证明许多被粗粒度方法视为重要的神经元实际上与任务无关,同时仍能保持目标任务性能。值得注意的是,该方法的内存占用显著降低5-10倍,因其无需在内存中保留中间激活值。