Train-time data poisoning attacks threaten machine learning models by introducing adversarial examples during training, leading to misclassification. Current defense methods often reduce generalization performance, are attack-specific, and impose significant training overhead. To address this, we introduce a set of universal data purification methods using a stochastic transform, $\Psi(x)$, realized via iterative Langevin dynamics of Energy-Based Models (EBMs), Denoising Diffusion Probabilistic Models (DDPMs), or both. These approaches purify poisoned data with minimal impact on classifier generalization. Our specially trained EBMs and DDPMs provide state-of-the-art defense against various attacks (including Narcissus, Bullseye Polytope, Gradient Matching) on CIFAR-10, Tiny-ImageNet, and CINIC-10, without needing attack or classifier-specific information. We discuss performance trade-offs and show that our methods remain highly effective even with poisoned or distributionally shifted generative model training data.
翻译:训练时数据投毒攻击通过在训练过程中引入对抗样本威胁机器学习模型,导致误分类。现有防御方法通常降低泛化性能、仅针对特定攻击类型,且带来显著训练开销。为解决这些问题,我们提出一套基于随机变换$\Psi(x)$的通用数据净化方法,该变换通过基于能量的模型(EBMs)、去噪扩散概率模型(DDPMs)或两者结合的迭代朗之万动力学实现。这些方法能以最小化分类器泛化性能影响的方式净化投毒数据。我们专门训练的EBMs和DDPMs在CIFAR-10、Tiny-ImageNet和CINIC-10数据集上,无需攻击或分类器先验信息即可为多种攻击(包括Narcissus、Bullseye Polytope、Gradient Matching)提供最先进的防御性能。我们分析了性能权衡,并证明即使使用投毒或分布偏移的生成模型训练数据,本方法仍能保持高效防御能力。