Self-interpreting neural networks have garnered significant interest in research. Existing works in this domain often (1) lack a solid theoretical foundation ensuring genuine interpretability or (2) compromise model expressiveness. In response, we formulate a generic Additive Self-Attribution (ASA) framework. Observing the absence of Shapley value in Additive Self-Attribution, we propose Shapley Additive Self-Attributing Neural Network (SASANet), with theoretical guarantees for the self-attribution value equal to the output's Shapley values. Specifically, SASANet uses a marginal contribution-based sequential schema and internal distillation-based training strategies to model meaningful outputs for any number of features, resulting in un-approximated meaningful value function. Our experimental results indicate SASANet surpasses existing self-attributing models in performance and rivals black-box models. Moreover, SASANet is shown more precise and efficient than post-hoc methods in interpreting its own predictions.
翻译:自解释神经网络在研究中引起了广泛关注。当前该领域的工作通常存在以下问题:(1)缺乏确保真正可解释性的坚实理论基础,或(2)牺牲了模型的表现力。针对此,我们提出了一个通用的加性自归因框架。观察到加性自归因中缺乏Shapley值,我们进一步提出了Shapley加性自归因神经网络,该网络从理论上保证了自归因值等于输出的Shapley值。具体而言,SASANet采用基于边际贡献的序贯方案与内部蒸馏训练策略,为任意数量的特征建模有意义的输出,从而获得无需近似的有效价值函数。实验结果表明,SASANet在性能上超越了现有自归因模型,并可媲美黑箱模型。此外,与事后解释方法相比,SASANet在解释自身预测时表现出更高的精确性和效率。