Red Teaming Deep Neural Networks with Feature Synthesis Tools

Interpretable AI tools are often motivated by the goal of understanding model behavior in out-of-distribution (OOD) contexts. Despite the attention this area of study receives, there are comparatively few cases where these tools have identified previously unknown bugs in models. We argue that this is due, in part, to a common feature of many interpretability methods: they analyze model behavior by using a particular dataset. This only allows for the study of the model in the context of features that the user can sample in advance. To address this, a growing body of research involves interpreting models using \emph{feature synthesis} methods that do not depend on a dataset. In this paper, we benchmark the usefulness of interpretability tools on debugging tasks. Our key insight is that we can implant human-interpretable trojans into models and then evaluate these tools based on whether they can help humans discover them. This is analogous to finding OOD bugs, except the ground truth is known, allowing us to know when an interpretation is correct. We make four contributions. (1) We propose trojan discovery as an evaluation task for interpretability tools and introduce a benchmark with 12 trojans of 3 different types. (2) We demonstrate the difficulty of this benchmark with a preliminary evaluation of 16 state-of-the-art feature attribution/saliency tools. Even under ideal conditions, given direct access to data with the trojan trigger, these methods still often fail to identify bugs. (3) We evaluate 7 feature-synthesis methods on our benchmark. (4) We introduce and evaluate 2 new variants of the best-performing method from the previous evaluation. A website for this paper and its code is at https://benchmarking-interpretability.csail.mit.edu/

翻译：可解释人工智能工具通常旨在理解模型在分布外（OOD）情境下的行为。尽管该研究领域备受关注，但很少有案例表明这些工具已识别出模型中先前未知的缺陷。我们认为，这在一定程度上归因于许多可解释性方法的一个共同特征：它们通过使用特定数据集来分析模型行为。这仅允许在用户可提前采样的特征背景下研究模型。为解决这一问题，越来越多研究采用不依赖于数据集的*特征合成*方法来解释模型。在本文中，我们以调试任务为基准，评估可解释性工具的实际效用。关键见解在于：我们可将人类可理解的后门植入模型，然后根据这些工具能否帮助人类发现后门来评估它们。这类似于发现OOD缺陷，但区别在于真实情况已知，从而能判断解释是否正确。我们做出四项贡献：（1）提出后门发现作为可解释性工具的评估任务，并引入包含12个后门（分属3种类型）的基准；（2）通过初步评估16种最先进的特征归因/显著性工具，证明该基准的难度——即使在理想条件下（可直接访问带后门触发器的数据），这些方法仍经常无法识别缺陷；（3）在基准上评估7种特征合成方法；（4）引入并评估先前评估中最佳方法的2种新变体。本文相关网站及代码地址为：https://benchmarking-interpretability.csail.mit.edu/

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日