Trojan signatures, as described by Fields et al. (2021), are noticeable differences in the distribution of the trojaned class parameters (weights) and the non-trojaned class parameters of the trojaned model, that can be used to detect the trojaned model. Fields et al. (2021) found trojan signatures in computer vision classification tasks with image models, such as, Resnet, WideResnet, Densenet, and VGG. In this paper, we investigate such signatures in the classifier layer parameters of large language models of source code. Our results suggest that trojan signatures could not generalize to LLMs of code. We found that trojaned code models are stubborn, even when the models were poisoned under more explicit settings (finetuned with pre-trained weights frozen). We analyzed nine trojaned models for two binary classification tasks: clone and defect detection. To the best of our knowledge, this is the first work to examine weight-based trojan signature revelation techniques for large-language models of code and furthermore to demonstrate that detecting trojans only from the weights in such models is a hard problem.
翻译:木马特征(Trojan signatures),按照Fields等人(2021)的描述,是指受木马攻击模型的木马类别参数(权重)与非木马类别参数之间在分布上存在的显著差异,可用于检测模型是否被植入木马。Fields等人(2021)在计算机视觉分类任务中,使用图像模型(如Resnet、WideResnet、Densenet和VGG)发现了木马特征。本文针对源代码大语言模型(LLMs)的分类器层参数研究了此类特征。我们的结果表明,木马特征可能无法推广至代码领域的LLMs。我们发现被植入木马的代码模型具有顽固性,即使在更显式的设置下(使用冻结的预训练权重进行微调)对模型进行投毒,这种特性依然存在。针对克隆检测和缺陷检测这两项二分类任务,我们分析了九个被植入木马的模型。据我们所知,这是首项针对代码大语言模型研究基于权重的木马特征揭示技术的工作,并进一步证明了仅从该类模型的权重中检测木马是一个困难问题。