Causal tracing systematically intervenes on a large language model's (LLM's) internal representations to uncover and quantify the causal pathways linking specific inputs or computations to specific metrics of interest, quantifying the LLM's behavior. Building on previous single-component or single-layer studies, this paper presents a unified framework for causally tracing multiple components simultaneously. This framework systematically identifies the subsets of components (e.g., attention heads and multi-layer perceptron neurons) most critical to a desired target performance metric (e.g., accuracy and fairness). This is achieved by incorporating flexible interventions applied to a wide range of desired metrics. To address the combinatorial complexity of the multi-component problem, an efficient algorithm is designed that leverages soft interventions and a carefully designed metric transformation, converting the combinatorial search problem into a continuous one that can be solved efficiently under proper constraints, thereby generating proper binary decisions for selecting components. Experimental results demonstrate that the proposed method efficiently identifies subsets of the model's components that have a high impact on the target metric, outperforming existing baseline approaches. Our code is available at https://github.com/ZiruiYan/multi-component-causal-tracing.
翻译:因果追踪通过对大语言模型的内部表示进行系统性干预,以揭示并量化连接特定输入或计算与所关注指标之间的因果路径,从而量化模型行为。基于以往针对单个组件或单层的研究,本文提出一个统一框架,实现同时对多个组件进行因果追踪。该框架能够系统性地识别对期望目标性能指标(如准确率、公平性)最为关键的组件子集(如注意力头和多层感知机神经元)。通过引入可应用于多种目标指标的灵活干预方法实现这一目标。为应对多组件问题中的组合复杂性,我们设计了一种高效算法,该算法利用软干预和精心设计的度量转换,将组合搜索问题转化为一个可在适当约束下高效求解的连续优化问题,从而生成组件选择的二值决策。实验结果表明,所提方法能高效识别对目标指标具有高影响力的模型组件子集,其性能优于现有基线方法。我们的代码开源在 https://github.com/ZiruiYan/multi-component-causal-tracing。