Due to data's unavailability or large size, and the high computational and human labor costs of training machine learning models, it is a common practice to rely on open source pre-trained models whenever possible. However, this practice is worry some from the security perspective. Pre-trained models can be infected with Trojan attacks, in which the attacker embeds a trigger in the model such that the model's behavior can be controlled by the attacker when the trigger is present in the input. In this paper, we present our preliminary work on a novel method for Trojan model detection. Our method creates a signature for a model based on activation optimization. A classifier is then trained to detect a Trojan model given its signature. Our method achieves state of the art performance on two public datasets.
翻译:由于数据不可获取或规模庞大,且训练机器学习模型的计算与人力成本高昂,实践中通常尽可能依赖开源预训练模型。然而,这种做法在安全层面令人担忧——预训练模型可能遭受木马攻击,攻击者在模型中植入触发器,使得输入包含该触发器时模型行为可被攻击者控制。本文提出一种基于激活优化的木马模型检测新方法的初步研究成果。该方法通过激活优化为模型生成签名,随后训练分类器根据签名识别木马模型。实验表明,本方法在两项公开数据集上达到了当前最优性能。