Large Language Models (LLMs) are known to be vulnerable to backdoor attacks, where triggers embedded in poisoned samples can maliciously alter LLMs' behaviors. In this paper, we move beyond attacking LLMs and instead examine backdoor attacks through the novel lens of natural language explanations. Specifically, we leverage LLMs' generative capabilities to produce human-readable explanations for their decisions, enabling direct comparisons between explanations for clean and poisoned samples. Our results show that backdoored models produce coherent explanations for clean inputs but diverse and logically flawed explanations for poisoned data, a pattern consistent across classification and generation tasks for different backdoor attacks. Further analysis reveals key insights into the explanation generation process. At the token level, explanation tokens associated with poisoned samples only appear in the final few transformer layers. At the sentence level, attention dynamics indicate that poisoned inputs shift attention away from the original input context during explanation generation. These findings enhance our understanding of backdoor mechanisms in LLMs and present a promising framework for detecting vulnerabilities through explainability.
翻译:大型语言模型(LLM)已知易受后门攻击,即中毒样本中嵌入的触发器可能恶意改变LLM的行为。本文超越攻击LLM的视角,转而通过自然语言解释这一新颖视角审视后门攻击。具体而言,我们利用LLM的生成能力为其决策生成人类可读的解释,从而实现对干净样本与中毒样本解释的直接比较。研究结果表明,后门模型对干净输入能生成连贯的解释,但对中毒数据则产生多样化且存在逻辑缺陷的解释,这一模式在不同后门攻击的分类与生成任务中均保持一致。进一步分析揭示了解释生成过程的关键机制:在词元层面,与中毒样本相关的解释词元仅出现在最后几个Transformer层;在句子层面,注意力动态表明中毒输入在解释生成过程中会将注意力从原始输入语境转移。这些发现深化了我们对LLM后门机制的理解,并为通过可解释性检测漏洞提供了前景广阔的研究框架。