We investigate the problem of explainability for machine learning models, focusing on Feature Attribution Methods (FAMs) that evaluate feature importance through perturbation tests. Despite their utility, FAMs struggle to distinguish the contributions of different features, when their prediction changes are similar after perturbation. To enhance FAMs' discriminative power, we introduce Feature Attribution with Necessity and Sufficiency (FANS), which find a neighborhood of the input such that perturbing samples within this neighborhood have a high Probability of being Necessity and Sufficiency (PNS) cause for the change in predictions, and use this PNS as the importance of the feature. Specifically, FANS compute this PNS via a heuristic strategy for estimating the neighborhood and a perturbation test involving two stages (factual and interventional) for counterfactual reasoning. To generate counterfactual samples, we use a resampling-based approach on the observed samples to approximate the required conditional distribution. We demonstrate that FANS outperforms existing attribution methods on six benchmarks. Please refer to the source code via \url{https://github.com/DMIRLAB-Group/FANS}.
翻译:本研究探讨机器学习模型的可解释性问题,重点关注通过扰动测试评估特征重要性的特征归因方法。尽管此类方法具有实用性,但当不同特征在扰动后产生的预测变化相似时,现有方法难以区分其贡献度。为提升特征归因方法的判别能力,我们提出基于必要性及充分性的特征归因方法。该方法通过寻找输入样本的邻域,使得在该邻域内扰动的样本具有较高的必要性及充分性概率成为预测变化的原因,并将此概率作为特征重要性度量。具体而言,FANS通过启发式邻域估计策略与包含事实层及干预层两阶段的反事实推理扰动测试来计算该概率。为生成反事实样本,我们在观测样本上采用基于重采样的方法以近似所需的条件分布。我们在六个基准测试中验证了FANS优于现有归因方法的性能。源代码请访问:\url{https://github.com/DMIRLAB-Group/FANS}。