Causal probing aims to analyze foundation models by examining how intervening on their representation of various latent properties impacts their outputs. Recent works have cast doubt on the theoretical basis of several leading causal probing methods, but it has been unclear how to systematically evaluate the effectiveness of these methods in practice. To address this, we define two key causal probing desiderata: completeness (how thoroughly the representation of the target property has been transformed) and selectivity (how little non-targeted properties have been impacted). We find that there is an inherent tradeoff between the two, which we define as reliability, their harmonic mean. We introduce an empirical analysis framework to measure and evaluate these quantities, allowing us to make the first direct comparisons between different families of leading causal probing methods (e.g., linear vs. nonlinear, or concept removal vs. counterfactual interventions). We find that: (1) no method is reliable across all layers; (2) more reliable methods have a greater impact on LLM behavior; (3) nonlinear interventions are more reliable in early and intermediate layers, and linear interventions are more reliable in later layers; and (4) concept removal methods are far less reliable than counterfactual interventions, suggesting that they may not be an effective approach to causal probing.
翻译:因果探测旨在通过干预基础模型对各类潜在属性的表征,分析其如何影响模型输出。近期研究对几种主流因果探测方法的理论基础提出了质疑,但如何系统评估这些方法在实际应用中的有效性尚不明确。为此,我们定义了两个关键的因果探测需求:完备性(目标属性表征被转化的彻底程度)与选择性(非目标属性受影响的程度)。我们发现二者存在固有权衡关系,并将其定义为可靠性——即二者的调和平均数。我们提出了一个实证分析框架来度量与评估这些指标,从而首次实现了对不同类型主流因果探测方法(如线性与非线性干预、概念移除与反事实干预)的直接比较。研究发现:(1)没有任何方法能在所有网络层中保持可靠;(2)更可靠的方法对大型语言模型行为的影响更大;(3)非线性干预在早期和中间层更可靠,而线性干预在深层更可靠;(4)概念移除方法的可靠性远低于反事实干预,表明其可能并非有效的因果探测途径。