Large language models (LLMs) have revolutionized software development practices, yet concerns about their safety have arisen, particularly regarding hidden backdoors, aka trojans. Backdoor attacks involve the insertion of triggers into training data, allowing attackers to manipulate the behavior of the model maliciously. In this paper, we focus on analyzing the model parameters to detect potential backdoor signals in code models. Specifically, we examine attention weights and biases, activation values, and context embeddings of the clean and poisoned CodeBERT models. Our results suggest noticeable patterns in activation values and context embeddings of poisoned samples for the poisoned CodeBERT model; however, attention weights and biases do not show any significant differences. This work contributes to ongoing efforts in white-box detection of backdoor signals in LLMs of code through the analysis of parameters and activations.
翻译:大语言模型(LLMs)已彻底改变了软件开发实践,但对其安全性的担忧也随之而来,特别是关于隐藏后门(即特洛伊木马)的问题。后门攻击涉及在训练数据中植入触发器,使攻击者能够恶意操控模型行为。本文聚焦于分析模型参数,以检测代码模型中潜在的后门信号。具体而言,我们分别检查了干净模型与投毒模型CodeBERT的注意力权重与偏置、激活值以及上下文嵌入。结果表明,在投毒CodeBERT模型中,投毒样本的激活值和上下文嵌入存在显著异常模式;然而,注意力权重与偏置并未表现出任何显著差异。本研究通过参数和激活分析,为代码LLMs中后门信号的白盒检测工作做出了贡献。