Robust Feature-Level Adversaries are Interpretability Tools

The literature on adversarial attacks in computer vision typically focuses on pixel-level perturbations. These tend to be very difficult to interpret. Recent work that manipulates the latent representations of image generators to create "feature-level" adversarial perturbations gives us an opportunity to explore perceptible, interpretable adversarial attacks. We make three contributions. First, we observe that feature-level attacks provide useful classes of inputs for studying representations in models. Second, we show that these adversaries are uniquely versatile and highly robust. We demonstrate that they can be used to produce targeted, universal, disguised, physically-realizable, and black-box attacks at the ImageNet scale. Third, we show how these adversarial images can be used as a practical interpretability tool for identifying bugs in networks. We use these adversaries to make predictions about spurious associations between features and classes which we then test by designing "copy/paste" attacks in which one natural image is pasted into another to cause a targeted misclassification. Our results suggest that feature-level attacks are a promising approach for rigorous interpretability research. They support the design of tools to better understand what a model has learned and diagnose brittle feature associations. Code is available at https://github.com/thestephencasper/feature_level_adv

翻译：计算机视觉中关于对抗攻击的文献通常关注像素级扰动，这类扰动往往极难解释。近期通过操纵图像生成器的潜在表示来创建"特征级"对抗扰动的研究，为我们探索可感知、可解释的对抗攻击提供了契机。本文做出三项贡献：首先，我们观察到特征级攻击为研究模型表征提供了有用的输入类别；其次，我们证明这类对抗样本具有独特的多功能性和高度鲁棒性，可以用于在ImageNet规模上生成有目标攻击、通用攻击、伪装攻击、物理可实现攻击及黑盒攻击；第三，我们展示了这些对抗图像如何作为实用的可解释性工具来识别网络中的缺陷。我们利用这些对抗样本预测特征与类别之间的虚假关联，随后通过设计"复制/粘贴"攻击（将一张自然图像粘贴到另一张图像中以引发目标误分类）来验证这些预测。实验结果表明，特征级攻击是严谨可解释性研究的一种有前景的方法，有助于设计更好的工具来理解模型所学到的知识并诊断脆弱的特征关联。代码已开源在https://github.com/thestephencasper/feature_level_adv。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日