Attributions aim to identify input pixels that are relevant to the decision-making process. A popular approach involves using modified backpropagation (BP) rules to reverse decisions, which improves interpretability compared to the original gradients. However, these methods lack a solid theoretical foundation and exhibit perplexing behaviors, such as reduced sensitivity to parameter randomization, raising concerns about their reliability and highlighting the need for theoretical justification. In this work, we present a unified theoretical framework for methods like GBP, RectGrad, LRP, and DTD, demonstrating that they achieve input alignment by combining the weights of activated neurons. This alignment improves the visualization quality and reduces sensitivity to weight randomization. Our contributions include: (1) Providing a unified explanation for multiple behaviors, rather than focusing on just one. (2) Accurately predicting novel behaviors. (3) Offering insights into decision-making processes, including layer-wise information changes and the relationship between attributions and model decisions.
翻译:归因方法旨在识别与决策过程相关的输入像素。一种流行方法涉及使用修正的反向传播规则来追溯决策,相比原始梯度提高了可解释性。然而,这些方法缺乏坚实的理论基础,并表现出令人困惑的行为(例如对参数随机化的敏感性降低),引发了对其可靠性的担忧,并凸显了理论论证的必要性。本研究为GBP、RectGrad、LRP和DTD等方法提出了统一的理论框架,证明它们通过组合激活神经元的权重来实现输入对齐。这种对齐提高了可视化质量并降低了对权重随机化的敏感性。我们的贡献包括:(1)为多种行为提供统一解释,而非仅关注单一现象。(2)准确预测新型行为模式。(3)深入揭示决策过程,包括逐层信息变化以及归因与模型决策之间的关系。