Recent advancements in interpretability research made transformer language models more transparent. This progress led to a better understanding of their inner workings for toy and naturally occurring models. However, how these models internally process sentiment changes has yet to be sufficiently answered. In this work, we introduce a new interpretability tool called PCP ablation, where we replace modules with low-rank matrices based on the principal components of their activations, reducing model parameters and their behavior to essentials. We demonstrate PCP ablations on MLP and attention layers in backdoored toy, backdoored large, and naturally occurring models. We determine MLPs as most important for the backdoor mechanism and use this knowledge to remove, insert, and modify backdoor mechanisms with engineered replacements via PCP ablation.
翻译:最近可解释性研究的最新进展使Transformer语言模型更加透明。这一进步促进了对玩具模型和自然模型内部运作机制的理解。然而,这些模型如何处理情感变化的内在机制仍未得到充分解答。本研究提出一种名为PCP消融的新型可解释性工具,通过基于激活主成分的低秩矩阵替换模型模块,将模型参数及其行为简化至核心要素。我们在后门玩具模型、后门大模型及自然模型中的MLP和注意力层上验证了PCP消融方法。实验确定多层感知器(MLP)对后门机制最为关键,并利用这一发现通过PCP消融技术使用工程化替换来移除、插入和修改后门机制。