Training machine learning models can be very expensive or even unaffordable. This may be, for example, due to data limitations (unavailability or being too large), or computational power limitations. Therefore, it is a common practice to rely on open-source pre-trained models whenever possible. However, this practice is alarming from a security perspective. Pre-trained models can be infected with Trojan attacks, in which the attacker embeds a trigger in the model such that the model's behavior can be controlled by the attacker when the trigger is present in the input. In this paper, we present a novel method for detecting Trojan models. Our method creates a signature for a model based on activation optimization. A classifier is then trained to detect a Trojan model given its signature. We call our method TRIGS for TRojan Identification from Gradient-based Signatures. TRIGS achieves state-of-the-art performance on two public datasets of convolutional models. Additionally, we introduce a new challenging dataset of ImageNet models based on the vision transformer architecture. TRIGS delivers the best performance on the new dataset, surpassing the baseline methods by a large margin. Our experiments also show that TRIGS requires only a small amount of clean samples to achieve good performance, and works reasonably well even if the defender does not have prior knowledge about the attacker's model architecture. Our dataset will be released soon.
翻译:训练机器学习模型可能非常昂贵,甚至难以负担。例如,这可能由于数据限制(不可用或过于庞大)或计算能力限制所致。因此,尽可能依赖开源预训练模型是一种常见做法。然而,这种做法从安全角度来看令人担忧。预训练模型可能受到木马攻击,攻击者在模型中嵌入触发器,使得当输入中存在该触发器时,攻击者能够控制模型的行为。本文提出了一种检测木马模型的新方法。我们的方法基于激活优化为模型生成签名,然后训练一个分类器,根据其签名检测木马模型。我们将该方法称为TRIGS(基于梯度签名的木马识别)。TRIGS在卷积模型的两个公开数据集上达到了最先进的性能。此外,我们基于视觉Transformer架构引入了一个具有挑战性的新ImageNet模型数据集。TRIGS在新数据集上取得了最佳性能,以较大优势超越了基线方法。我们的实验还表明,TRIGS仅需少量干净样本即可获得良好性能,并且即使防御者事先不了解攻击者的模型架构,也能合理工作。我们的数据集将很快发布。