Adversarial examples have raised several open questions, such as why they can deceive classifiers and transfer between different models. A prevailing hypothesis to explain these phenomena suggests that adversarial perturbations appear as random noise but contain class-specific features. This hypothesis is supported by the success of perturbation learning, where classifiers trained solely on adversarial examples and the corresponding incorrect labels generalize well to correctly labeled test data. Although this hypothesis and perturbation learning are effective in explaining intriguing properties of adversarial examples, their solid theoretical foundation is limited. In this study, we theoretically explain the counterintuitive success of perturbation learning. We assume wide two-layer networks and the results hold for any data distribution. We prove that adversarial perturbations contain sufficient class-specific features for networks to generalize from them. Moreover, the predictions of classifiers trained on mislabeled adversarial examples coincide with those of classifiers trained on correctly labeled clean samples. The code is available at https://github.com/s-kumano/perturbation-learning.
翻译:对抗性样本引发了若干开放性问题,例如为何它们能够欺骗分类器并在不同模型间迁移。解释这些现象的主流假设认为,对抗性扰动看似随机噪声,但蕴含类别特异性特征。这一假设得到了扰动学习成功的支持:仅使用对抗性样本及其对应错误标签训练的分类器,能够良好地泛化至正确标注的测试数据。尽管该假设与扰动学习能有效解释对抗性样本的诸多奇特性质,其坚实的理论基础仍较为有限。本研究从理论上阐释了扰动学习反直觉的成功机制。我们假设采用宽双层网络结构,且结论适用于任意数据分布。我们证明对抗性扰动包含足够的类别特异性特征,使网络能够从中实现泛化。此外,在误标注对抗性样本上训练的分类器,其预测结果与在正确标注的干净样本上训练的分类器完全一致。代码发布于 https://github.com/s-kumano/perturbation-learning。