Large language models (LLMs) have revolutionized software development practices, yet concerns about their safety have arisen, particularly regarding hidden backdoors, aka trojans. Backdoor attacks involve the insertion of triggers into training data, allowing attackers to manipulate the behavior of the model maliciously. In this paper, we focus on analyzing the model parameters to detect potential backdoor signals in code models. Specifically, we examine attention weights and biases, activation values, and context embeddings of the clean and poisoned CodeBERT models. Our results suggest noticeable patterns in activation values and context embeddings of poisoned samples for the poisoned CodeBERT model; however, attention weights and biases do not show any significant differences. This work contributes to ongoing efforts in white-box detection of backdoor signals in LLMs of code through the analysis of parameters and activations.
翻译:大型语言模型(LLMs)已彻底改变软件开发实践,但其安全性问题日益凸显,尤其涉及隐藏后门(即特洛伊木马)。后门攻击通过在训练数据中插入触发器,使攻击者能够恶意操控模型行为。本文聚焦于分析模型参数,以检测代码模型中的潜在后门信号。具体而言,我们分别考察了干净与受投毒CodeBERT模型的注意力权重与偏置、激活值及上下文嵌入。研究结果表明,受投毒CodeBERT模型中投毒样本的激活值与上下文嵌入呈现显著模式差异,而注意力权重与偏置则未观察到明显变化。本工作通过参数与激活分析,为代码LLM中后门信号的白盒检测研究提供了持续贡献。