Pre-trained language models have achieved remarkable success across a wide range of natural language processing (NLP) tasks, particularly when fine-tuned on large, domain-relevant datasets. However, they remain vulnerable to backdoor attacks, where adversaries embed malicious behaviors using trigger patterns in the training data. These triggers remain dormant during normal usage, but, when activated, can cause targeted misclassifications. In this work, we investigate the internal behavior of backdoored pre-trained encoder-based language models, focusing on the consistent shift in attention and gradient attribution when processing poisoned inputs; where the trigger token dominates both attention and gradient signals, overriding the surrounding context. We propose an inference-time defense that constructs anomaly scores by combining token-level attention and gradient information. Extensive experiments on text classification tasks across diverse backdoor attack scenarios demonstrate that our method significantly reduces attack success rates compared to existing baselines. Furthermore, we provide an interpretability-driven analysis of the scoring mechanism, shedding light on trigger localization and the robustness of the proposed defense.
翻译:预训练语言模型在广泛的自然语言处理任务中取得了显著成功,尤其是在大型领域相关数据集上进行微调后。然而,它们仍然容易受到后门攻击,攻击者通过在训练数据中嵌入带有触发模式的恶意行为来实现攻击。这些触发器在正常使用时保持休眠状态,但一旦被激活,就会导致有目标的错误分类。在本工作中,我们研究了基于编码器的后门预训练语言模型的内部行为,重点关注其在处理中毒输入时注意力和梯度归因的一致性偏移;即触发词元主导了注意力和梯度信号,覆盖了周围的上下文信息。我们提出了一种推理时防御方法,通过结合词元级注意力和梯度信息来构建异常评分。在多种后门攻击场景下的文本分类任务上进行的大量实验表明,与现有基线方法相比,我们的方法显著降低了攻击成功率。此外,我们对评分机制进行了可解释性驱动的分析,阐明了触发器的定位机制以及所提防御方法的鲁棒性。