In the supervised classification setting, during inference, deep networks typically make multiple predictions. For a pair of such predictions (that are in the top-k predictions), two distinct possibilities might occur. On the one hand, each of the two predictions might be primarily driven by two distinct sets of entities in the input. On the other hand, it is possible that there is a single entity or set of entities that is driving the prediction for both the classes in question. This latter case, in effect, corresponds to the network making two separate guesses about the identity of a single entity type. Clearly, both the guesses cannot be true, i.e. both the labels cannot be present in the input. Current techniques in interpretability research do not readily disambiguate these two cases, since they typically consider input attributions for one class label at a time. Here, we present a framework and method to do so, leveraging modern segmentation and input attribution techniques. Notably, our framework also provides a simple counterfactual "proof" of each case, which can be verified for the input on the model (i.e. without running the method again). We demonstrate that the method performs well for a number of samples from the ImageNet validation set and on multiple models.
翻译:在监督分类设置中,深度网络在推理过程中通常会生成多个预测结果。对于一对这样的预测(位于前k个预测中),可能出现两种截然不同的情况。一方面,这两个预测可能分别主要由输入中的两个不同实体集合驱动。另一方面,可能存在单个实体或实体集合同时驱动这两个相关类别的预测。后一种情况实际上对应于网络对单一实体类型做出了两种不同的身份猜测。显然,这两种猜测不可能同时成立,即两个标签不可能同时存在于输入中。当前可解释性研究中的技术难以区分这两种情况,因为它们通常每次只考虑单个类别标签的输入归因。本文提出了一种解决此问题的框架与方法,该方法融合了现代分割技术与输入归因技术。值得注意的是,该框架还能为每种情况提供简单的反事实"证明",并可在模型上直接验证输入(无需重新运行方法)。我们通过ImageNet验证集中的大量样本及多种模型实验证明,该方法具有优异的表现。