Attribution methods reveal which input features a neural network uses for a prediction, adding transparency to their decisions. A common problem is that these attributions seem unspecific, highlighting both important and irrelevant features. We revisit the common attribution pipeline and observe that using logits as attribution target is a main cause of this phenomenon. We show that the solution is in plain sight: considering distributions of attributions over multiple classes using existing attribution methods yields specific and fine-grained attributions. On common benchmarks, including the grid-pointing game and randomization-based sanity checks, this improves the ability of 18 attribution methods across 7 architectures up to 2x, agnostic to model architecture.
翻译:归因方法揭示了神经网络在预测时使用了哪些输入特征,从而增加了其决策的透明度。一个常见的问题是这些归因似乎不够具体,同时突出了重要和不相关的特征。我们重新审视了常见的归因流程,并观察到使用逻辑值作为归因目标是导致这一现象的主要原因。我们证明,解决方案其实显而易见:利用现有归因方法,考虑多个类别上的归因分布,即可获得具体且细粒度的归因结果。在包括网格指向游戏和基于随机化的合理性检验在内的常见基准测试中,这一方法将7种架构下的18种归因方法的性能提升高达2倍,且与模型架构无关。