Interpretable AI tools are often motivated by the goal of understanding model behavior in out-of-distribution (OOD) contexts. Despite the attention this area of study receives, there are comparatively few cases where these tools have identified previously unknown bugs in models. We argue that this is due, in part, to a common feature of many interpretability methods: they analyze model behavior by using a particular dataset. This only allows for the study of the model in the context of features that the user can sample in advance. To address this, a growing body of research involves interpreting models using \emph{feature synthesis} methods that do not depend on a dataset. In this paper, we benchmark the usefulness of interpretability tools on debugging tasks. Our key insight is that we can implant human-interpretable trojans into models and then evaluate these tools based on whether they can help humans discover them. This is analogous to finding OOD bugs, except the ground truth is known, allowing us to know when an interpretation is correct. We make four contributions. (1) We propose trojan discovery as an evaluation task for interpretability tools and introduce a benchmark with 12 trojans of 3 different types. (2) We demonstrate the difficulty of this benchmark with a preliminary evaluation of 16 state-of-the-art feature attribution/saliency tools. Even under ideal conditions, given direct access to data with the trojan trigger, these methods still often fail to identify bugs. (3) We evaluate 7 feature-synthesis methods on our benchmark. (4) We introduce and evaluate 2 new variants of the best-performing method from the previous evaluation. A website for this paper and its code is at https://benchmarking-interpretability.csail.mit.edu/
翻译:可解释人工智能工具通常旨在理解模型在分布外(OOD)情境下的行为。尽管该研究领域备受关注,但很少有案例表明这些工具已识别出模型中先前未知的缺陷。我们认为,这在一定程度上归因于许多可解释性方法的一个共同特征:它们通过使用特定数据集来分析模型行为。这仅允许在用户可提前采样的特征背景下研究模型。为解决这一问题,越来越多研究采用不依赖于数据集的*特征合成*方法来解释模型。在本文中,我们以调试任务为基准,评估可解释性工具的实际效用。关键见解在于:我们可将人类可理解的后门植入模型,然后根据这些工具能否帮助人类发现后门来评估它们。这类似于发现OOD缺陷,但区别在于真实情况已知,从而能判断解释是否正确。我们做出四项贡献:(1)提出后门发现作为可解释性工具的评估任务,并引入包含12个后门(分属3种类型)的基准;(2)通过初步评估16种最先进的特征归因/显著性工具,证明该基准的难度——即使在理想条件下(可直接访问带后门触发器的数据),这些方法仍经常无法识别缺陷;(3)在基准上评估7种特征合成方法;(4)引入并评估先前评估中最佳方法的2种新变体。本文相关网站及代码地址为:https://benchmarking-interpretability.csail.mit.edu/