Backdoor attacks pose significant security risks for Large Language Models (LLMs), yet the internal mechanisms by which triggers operate remain poorly understood. We present the first mechanistic analysis of language-switching backdoors, studying the GAPperon model family (1B, 8B, 24B parameters) which contains triggers injected during pretraining that cause output language switching. Using activation patching, we localize trigger formation to early layers (7.5-25% of model depth) and identify which attention heads process trigger information. Our central finding is that trigger-activated heads substantially overlap with heads naturally encoding output language across model scales, with Jaccard indices between 0.18 and 0.66 over the top heads identified. This suggests that backdoor triggers do not form isolated circuits but instead co-opt the model's existing language components. These findings have implications for backdoor defense: detection methods may benefit from monitoring known functional components rather than searching for hidden circuits, and mitigation strategies could potentially leverage this entanglement between injected and natural behaviors.
翻译:后门攻击对大型语言模型构成重大安全风险,但触发器运作的内部机制仍鲜为人知。我们首次对语言切换后门进行机制分析,研究了GAPperon模型系列(10亿、80亿、240亿参数)——该系列在预训练阶段被注入了导致输出语言切换的触发器。通过激活修补技术,我们将触发器形成过程定位至模型浅层(模型深度的7.5%-25%),并识别出处理触发器信息的注意力头。核心发现表明:在不同规模模型中,触发器激活的注意力头与自然编码输出语言的注意力头存在显著重叠,在识别出的头部中杰卡德指数介于0.18至0.66之间。这意味着后门触发器并未形成独立回路,而是劫持了模型原有的语言组件。这些发现对后门防御具有重要启示:检测方法可通过监控已知功能组件而非搜寻隐藏回路获益,缓解策略则可能利用注入行为与自然行为之间的纠缠特性。