Attribution methods provide an insight into the decision-making process of machine learning models, especially deep neural networks, by assigning contribution scores to each individual feature. However, the attribution problem has not been well-defined, which lacks a unified guideline to the contribution assignment process. Furthermore, existing attribution methods often built upon various empirical intuitions and heuristics. There still lacks a general theoretical framework that not only can offer a good description of the attribution problem, but also can be applied to unifying and revisiting existing attribution methods. To bridge the gap, in this paper, we propose a Taylor attribution framework, which models the attribution problem as how to decide individual payoffs in a coalition. Then, we reformulate fourteen mainstream attribution methods into the Taylor framework and analyze these attribution methods in terms of rationale, fidelity, and limitation in the framework. Moreover, we establish three principles for a good attribution in the Taylor attribution framework, i.e., low approximation error, correct Taylor contribution assignment, and unbiased baseline selection. Finally, we empirically validate the Taylor reformulations and reveal a positive correlation between the attribution performance and the number of principles followed by the attribution method via benchmarking on real-world datasets.
翻译:归因方法通过为每个特征分配贡献分数,揭示机器学习模型(特别是深度神经网络)的决策过程。然而,归因问题尚未被严格定义,缺乏统一的贡献分配准则。此外,现有归因方法常基于多种经验直觉与启发式规则,仍缺少一个能良好描述归因问题并统一、重访现有归因方法的通用理论框架。为填补这一空白,本文提出泰勒归因框架,将归因问题建模为联盟中个体收益的分配方式。进而,我们将十四种主流归因方法重新表述为泰勒框架形式,并从合理性、保真度及局限性角度进行分析。基于该框架,我们建立了良好归因的三项原则:低近似误差、正确的泰勒贡献分配以及无偏基线选择。最后,我们通过真实数据集的基准测试,实证验证了泰勒重表述的有效性,并揭示了归因性能与归因方法遵循原则数量之间的正相关性。