Backdoor attacks pose a serious security threat for training neural networks as they surreptitiously introduce hidden functionalities into a model. Such backdoors remain silent during inference on clean inputs, evading detection due to inconspicuous behavior. However, once a specific trigger pattern appears in the input data, the backdoor activates, causing the model to execute its concealed function. Detecting such poisoned samples within vast datasets is virtually impossible through manual inspection. To address this challenge, we propose a novel approach that enables model training on potentially poisoned datasets by utilizing the power of recent diffusion models. Specifically, we create synthetic variations of all training samples, leveraging the inherent resilience of diffusion models to potential trigger patterns in the data. By combining this generative approach with knowledge distillation, we produce student models that maintain their general performance on the task while exhibiting robust resistance to backdoor triggers.
翻译:后门攻击对神经网络的训练构成了严重的安全威胁,因为它会秘密地向模型中植入隐藏功能。这些后门在干净输入的推理过程中保持沉默,因其不显眼的行为而逃避检测。然而,一旦输入数据中出现特定的触发模式,后门就会激活,导致模型执行其隐藏功能。在庞大的数据集中通过人工检查检测此类受污染样本几乎是不可能的。为应对这一挑战,我们提出了一种新颖方法,利用最新扩散模型的能力,在可能受污染的数据集上实现模型训练。具体而言,我们创建所有训练样本的合成变体,利用扩散模型对数据中潜在触发模式的内在鲁棒性。通过将这种生成方法与知识蒸馏相结合,我们生成的学生模型在任务上保持了整体性能,同时对后门触发器展现出强大的抗性。