Detecting adversarial samples that are carefully crafted to fool the model is a critical step to socially-secure applications. However, existing adversarial detection methods require access to sufficient training data, which brings noteworthy concerns regarding privacy leakage and generalizability. In this work, we validate that the adversarial sample generated by attack algorithms is strongly related to a specific vector in the high-dimensional inputs. Such vectors, namely UAPs (Universal Adversarial Perturbations), can be calculated without original training data. Based on this discovery, we propose a data-agnostic adversarial detection framework, which induces different responses between normal and adversarial samples to UAPs. Experimental results show that our method achieves competitive detection performance on various text classification tasks, and maintains an equivalent time consumption to normal inference.
翻译:攻击性样本是精心构造以欺骗模型的样本,检测这些样本是社会安全应用的关键步骤。然而,现有对抗检测方法需借助充分的训练数据,这引发了关于隐私泄露和泛化能力的显著问题。在本工作中,我们验证了攻击算法生成的对抗样本与高维输入中特定向量密切相关。这些向量称为通用对抗扰动(Universal Adversarial Perturbations, UAPs),可在无原始训练数据的情况下计算得到。基于此发现,我们提出了一种数据无关的对抗检测框架,该框架通过诱导正常样本与对抗样本对UAPs产生不同响应来实现检测。实验结果表明,我们的方法在多种文本分类任务上取得了具有竞争力的检测性能,且推理耗时与常规推理相当。