Interpretable AI tools are often motivated by the goal of understanding model behavior in out-of-distribution (OOD) contexts. Despite the attention this area of study receives, there are comparatively few cases where these tools have identified novel, previously unknown, bugs in models. We argue that this is due, in part, to a common feature of many interpretability methods: they analyze and explain the behavior of a model using a particular dataset. While this is useful, such tools can only analyze behaviors induced by features that the user can sample or identify in advance. To address this, a growing body of research involves interpreting models using feature synthesis methods which do not depend on a dataset. In this paper, our primary contribution is a benchmark to evaluate interpretability tools. Our key insight is that we can train models that respond to specific triggers (e.g., a specific patch inserted into an image) with specific outputs (i.e. a label) and then evaluate interpretability tools based on whether they help humans identify these triggers. We make four contributions. (1) We propose trojan discovery as an evaluation task for interpretability tools and introduce a trojan-discovery benchmark with 12 trojans of 3 different types. (2) We demonstrate the difficulty of this benchmark with a preliminary evaluation of 16 feature attribution/saliency tools. Even with access to data with a trojan's trigger, these methods regularly fail to identify bugs. (3) We evaluate 7 feature-synthesis methods on our benchmark. (4) We introduce and evaluate 2 variants of the best-performing method from the previous evaluation.
翻译:可解释人工智能工具常以理解模型在分布外(OOD)情境下的行为为目标。尽管该领域受到广泛关注,但较少有案例表明这些工具能识别出模型中此前未知的新错误。我们认为,这在一定程度上源于许多可解释性方法的共同特征:它们依赖特定数据集来分析和解释模型行为。虽然这很有用,但此类工具仅能分析由用户预先采样或识别的特征所引发的行为。为解决这一问题,越来越多的研究采用不依赖数据集的特征合成方法对模型进行解释。本文的主要贡献是提出一个用于评估可解释性工具的基准测试。我们的关键洞察在于:可以训练模型使其对特定触发条件(如图像中插入的特定补丁)产生特定输出(即标签),然后根据可解释性工具是否帮助人类识别这些触发条件来评估它们。我们做出四项贡献:(1)提出将木马发现作为可解释性工具的评估任务,并引入包含3种类型共12个木马的木马发现基准测试;(2)通过对16种特征归因/显著性工具的初步评估,证明该基准测试的难度——即使能访问包含木马触发条件的数据,这些方法也经常无法识别错误;(3)在基准测试上评估7种特征合成方法;(4)介绍并评估此前最佳方法的两种变体。