Interpretable machine learning offers insights into what factors drive a certain prediction of a black-box system. A large number of interpreting methods focus on identifying explanatory input features, which generally fall into two main categories: attribution and selection. A popular attribution-based approach is to exploit local neighborhoods for learning instance-specific explainers in an additive manner. The process is thus inefficient and susceptible to poorly-conditioned samples. Meanwhile, many selection-based methods directly optimize local feature distributions in an instance-wise training framework, thereby being capable of leveraging global information from other inputs. However, they can only interpret single-class predictions and many suffer from inconsistency across different settings, due to a strict reliance on a pre-defined number of features selected. This work exploits the strengths of both methods and proposes a framework for learning local explanations simultaneously for multiple target classes. Our model explainer significantly outperforms additive and instance-wise counterparts on faithfulness with more compact and comprehensible explanations. We also demonstrate the capacity to select stable and important features through extensive experiments on various data sets and black-box model architectures.
翻译:可解释机器学习能够揭示黑盒系统做出特定预测背后的驱动因素。众多解释方法聚焦于识别具有解释性的输入特征,主要归为两类:归因法(attribution)与选择法(selection)。基于归因的主流方法通过利用局部邻域以加性方式学习实例专属解释器,但该过程效率低下且易受病态样本影响。与此同时,许多基于选择的方法在实例级训练框架中直接优化局部特征分布,从而能够利用其他输入的全局信息。然而,这些方法仅能解释单类别预测,且由于严格依赖预选特征数量,常在不同设置间存在不一致性问题。本研究融合两类方法的优势,提出一种可同时学习多类别局部解释的框架。我们提出的模型解释器在忠实度、解释紧凑性与可理解性方面显著优于加性及实例级方法。通过在多种数据集与黑盒模型架构上的大量实验,我们进一步展示了该方法选择稳定且重要特征的能力。