Backdoor attacks pose a serious security threat for training neural networks as they surreptitiously introduce hidden functionalities into a model. Such backdoors remain silent during inference on clean inputs, evading detection due to inconspicuous behavior. However, once a specific trigger pattern appears in the input data, the backdoor activates, causing the model to execute its concealed function. Detecting such poisoned samples within vast datasets is virtually impossible through manual inspection. To address this challenge, we propose a novel approach that enables model training on potentially poisoned datasets by utilizing the power of recent diffusion models. Specifically, we create synthetic variations of all training samples, leveraging the inherent resilience of diffusion models to potential trigger patterns in the data. By combining this generative approach with knowledge distillation, we produce student models that maintain their general performance on the task while exhibiting robust resistance to backdoor triggers.
翻译:后门攻击对神经网络训练构成严重安全威胁,它会在模型中隐蔽地引入隐藏功能。这些后门在正常输入推理期间保持静默,因行为不起眼而难以被检测。然而,一旦输入数据中出现特定触发模式,后门便会激活,导致模型执行其隐藏功能。在庞大数据集中检测此类中毒样本几乎无法通过人工审查实现。为应对这一挑战,我们提出了一种新颖方法,能够利用最新扩散模型的能力在潜在中毒数据集上训练模型。具体而言,我们创建所有训练样本的合成变体,利用扩散模型对数据中潜在触发模式固有的鲁棒性。通过将这种生成方法与知识蒸馏相结合,我们生成的学生模型在保持任务整体性能的同时,展现出对后门触发器的强大抵抗能力。