Deep learning has revolutionized many areas of machine learning, from computer vision to natural language processing, but these high-performance models are generally "black box." Explaining such models would improve transparency and trust in AI-powered decision making and is necessary for understanding other practical needs such as robustness and fairness. A popular means of enhancing model transparency is to quantify how individual inputs contribute to model outputs (called attributions) and the magnitude of interactions between groups of inputs. A growing number of these methods import concepts and results from game theory to produce attributions and interactions. This work presents a unifying framework for game-theory-inspired attribution and $k^\text{th}$-order interaction methods. We show that, given modest assumptions, a unique full account of interactions between features, called synergies, is possible in the continuous input setting. We identify how various methods are characterized by their policy of distributing synergies. We also demonstrate that gradient-based methods are characterized by their actions on monomials, a type of synergy function, and introduce unique gradient-based methods. We show that the combination of various criteria uniquely defines the attribution/interaction methods. Thus, the community needs to identify goals and contexts when developing and employing attribution and interaction methods.
翻译:深度学习已彻底变革了从计算机视觉到自然语言处理的诸多机器学习领域,但这些高性能模型通常是“黑箱”。解释此类模型可提升人工智能决策的透明度与可信度,同时也是理解鲁棒性、公平性等其他实际需求的必要条件。一种增强模型透明度的主流方法,是量化单个输入对模型输出的贡献(称为归因)以及输入组之间的交互强度。越来越多的此类方法引入博弈论的概念与结果来生成归因与交互。本研究提出了一个统一框架,用于刻画受博弈论启发的归因方法与$k^\text{阶}$交互方法。我们证明,在适度假设下,连续输入设置中可实现对特征间交互的完整唯一描述,即所谓协同效应。我们揭示了各种方法如何通过其分配协同效应的策略进行表征,同时论证了基于梯度的方法由其作用于单项式(一种协同函数类型)的特性决定,并引入了新颖的基于梯度的方法。我们展示了不同准则的组合唯一地定义了归因/交互方法。因此,社区在开发与应用归因及交互方法时,需明确目标与背景。