The literature on adversarial attacks in computer vision typically focuses on pixel-level perturbations. These tend to be very difficult to interpret. Recent work that manipulates the latent representations of image generators to create "feature-level" adversarial perturbations gives us an opportunity to explore perceptible, interpretable adversarial attacks. We make three contributions. First, we observe that feature-level attacks provide useful classes of inputs for studying representations in models. Second, we show that these adversaries are uniquely versatile and highly robust. We demonstrate that they can be used to produce targeted, universal, disguised, physically-realizable, and black-box attacks at the ImageNet scale. Third, we show how these adversarial images can be used as a practical interpretability tool for identifying bugs in networks. We use these adversaries to make predictions about spurious associations between features and classes which we then test by designing "copy/paste" attacks in which one natural image is pasted into another to cause a targeted misclassification. Our results suggest that feature-level attacks are a promising approach for rigorous interpretability research. They support the design of tools to better understand what a model has learned and diagnose brittle feature associations. Code is available at https://github.com/thestephencasper/feature_level_adv
翻译:计算机视觉中关于对抗攻击的文献通常关注像素级扰动,这类扰动往往极难解释。近期通过操纵图像生成器的潜在表示来创建"特征级"对抗扰动的研究,为我们探索可感知、可解释的对抗攻击提供了契机。本文做出三项贡献:首先,我们观察到特征级攻击为研究模型表征提供了有用的输入类别;其次,我们证明这类对抗样本具有独特的多功能性和高度鲁棒性,可以用于在ImageNet规模上生成有目标攻击、通用攻击、伪装攻击、物理可实现攻击及黑盒攻击;第三,我们展示了这些对抗图像如何作为实用的可解释性工具来识别网络中的缺陷。我们利用这些对抗样本预测特征与类别之间的虚假关联,随后通过设计"复制/粘贴"攻击(将一张自然图像粘贴到另一张图像中以引发目标误分类)来验证这些预测。实验结果表明,特征级攻击是严谨可解释性研究的一种有前景的方法,有助于设计更好的工具来理解模型所学到的知识并诊断脆弱的特征关联。代码已开源在https://github.com/thestephencasper/feature_level_adv。