With recent progress on fine-tuning language models around a fixed sparse autoencoder, we disentangle the decoder matrix into almost orthogonal features. This reduces interference and superposition between the features, while keeping performance on the target dataset essentially unchanged. Our orthogonality penalty leads to identifiable features, ensuring the uniqueness of the decomposition. Further, we find that the distance between embedded feature explanations increases with stricter orthogonality penalty, a desirable property for interpretability. Invoking the $\textit{Independent Causal Mechanisms}$ principle, we argue that orthogonality promotes modular representations amenable to causal intervention. We empirically show that these increasingly orthogonalized features allow for isolated interventions. Our code is available under $\texttt{https://github.com/mrtzmllr/sae-icm}$.
翻译:随着近期围绕固定稀疏自编码器微调语言模型的进展,我们将解码器矩阵解耦为近似正交的特征。这减少了特征间的干扰与叠加效应,同时使目标数据集上的性能基本保持不变。我们的正交惩罚项可产生可识别特征,确保分解的唯一性。此外,我们发现嵌入特征解释之间的距离随着正交惩罚强度的增加而增大,这是可解释性所需的理想特性。依据《独立因果机制》原理,我们认为正交性能促进适用于因果干预的模块化表征。我们通过实验证明,这些日益正交化的特征可实现孤立干预。相关代码已发布于:https://github.com/mrtzmllr/sae-icm。