Red Teaming Deep Neural Networks with Feature Synthesis Tools

Interpretable AI tools are often motivated by the goal of understanding model behavior in out-of-distribution (OOD) contexts. Despite the attention this area of study receives, there are comparatively few cases where these tools have identified novel, previously unknown, bugs in models. We argue that this is due, in part, to a common feature of many interpretability methods: they analyze and explain the behavior of a model using a particular dataset. While this is useful, such tools can only analyze behaviors induced by features that the user can sample or identify in advance. To address this, a growing body of research involves interpreting models using feature synthesis methods which do not depend on a dataset. In this paper, our primary contribution is a benchmark to evaluate interpretability tools. Our key insight is that we can train models that respond to specific triggers (e.g., a specific patch inserted into an image) with specific outputs (i.e. a label) and then evaluate interpretability tools based on whether they help humans identify these triggers. We make four contributions. (1) We propose trojan discovery as an evaluation task for interpretability tools and introduce a trojan-discovery benchmark with 12 trojans of 3 different types. (2) We demonstrate the difficulty of this benchmark with a preliminary evaluation of 16 feature attribution/saliency tools. Even with access to data with a trojan's trigger, these methods regularly fail to identify bugs. (3) We evaluate 7 feature-synthesis methods on our benchmark. (4) We introduce and evaluate 2 variants of the best-performing method from the previous evaluation.

翻译：可解释人工智能工具常以理解模型在分布外（OOD）情境下的行为为目标。尽管该领域受到广泛关注，但较少有案例表明这些工具能识别出模型中此前未知的新错误。我们认为，这在一定程度上源于许多可解释性方法的共同特征：它们依赖特定数据集来分析和解释模型行为。虽然这很有用，但此类工具仅能分析由用户预先采样或识别的特征所引发的行为。为解决这一问题，越来越多的研究采用不依赖数据集的特征合成方法对模型进行解释。本文的主要贡献是提出一个用于评估可解释性工具的基准测试。我们的关键洞察在于：可以训练模型使其对特定触发条件（如图像中插入的特定补丁）产生特定输出（即标签），然后根据可解释性工具是否帮助人类识别这些触发条件来评估它们。我们做出四项贡献：（1）提出将木马发现作为可解释性工具的评估任务，并引入包含3种类型共12个木马的木马发现基准测试；（2）通过对16种特征归因/显著性工具的初步评估，证明该基准测试的难度——即使能访问包含木马触发条件的数据，这些方法也经常无法识别错误；（3）在基准测试上评估7种特征合成方法；（4）介绍并评估此前最佳方法的两种变体。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日