This paper investigates how Backdoor Attacks are represented within Vision Transformers (ViTs). By assuming knowledge of the trigger, we identify a specific ``trigger direction'' in the model's activations that corresponds to the internal representation of the trigger. We confirm the causal role of this linear direction by showing that interventions in both activation and parameter space consistently modulate the model's backdoor behavior across multiple datasets and attack types. Using this direction as a diagnostic tool, we trace how backdoor features are processed across layers. Our analysis reveals distinct qualitative differences: static-patch triggers follow a different internal logic than stealthy, distributed triggers. We further examine the link between backdoors and adversarial attacks, specifically testing whether PGD-based perturbations (de-)activate the identified trigger mechanism. Finally, we propose a data-free, weight-based detection scheme for stealthy-trigger attacks. Our findings show that mechanistic interpretability offers a robust framework for diagnosing and addressing security vulnerabilities in computer vision.
翻译:本文研究了后门攻击在视觉Transformer(ViT)中的内部表征方式。通过假设已知触发器信息,我们在模型激活空间中识别出一个特定的"触发器方向",该方向对应触发器的内部表征。我们通过展示在激活空间和参数空间的双重干预能够跨多个数据集和攻击类型持续调节模型的后门行为,证实了这一线性方向的因果作用。利用该方向作为诊断工具,我们追踪了后门特征在Transformer各层中的处理过程。分析揭示了显著的质性差异:静态补丁触发器遵循与隐蔽分布式触发器不同的内部逻辑机制。我们进一步探究了后门攻击与对抗攻击之间的关联,特别测试了基于PGD的扰动是否能够(去)激活已识别的触发机制。最后,我们提出了一种针对隐蔽触发攻击的无数据权重检测方案。研究结果表明,机制可解释性为诊断和应对计算机视觉安全漏洞提供了稳健的理论框架。