To explain complex models based on their inputs, many feature attribution methods have been developed that assign importance scores to input features. However, some recent work challenges the robustness of feature attributions by showing that these methods are sensitive to input and model perturbations, while other work addresses this robustness issue by proposing robust attribution methods and model modifications. Nevertheless, previous work on attribution robustness has focused primarily on gradient-based feature attributions. In contrast, the robustness properties of removal-based attribution methods are not comprehensively well understood. To bridge this gap, we theoretically characterize the robustness of removal-based feature attributions. Specifically, we provide a unified analysis of such methods and prove upper bounds for the difference between intact and perturbed attributions, under settings of both input and model perturbations. Our empirical experiments on synthetic and real-world data validate our theoretical results and demonstrate their practical implications.
翻译:为解释基于输入的复杂模型,研究者已开发出多种为输入特征分配重要性分数的特征归因方法。然而,近期部分研究通过揭示这些方法对输入扰动和模型扰动敏感,挑战了特征归因的鲁棒性;同时,其他工作通过提出鲁棒归因方法及模型修改来应对这一鲁棒性问题。尽管如此,先前关于归因鲁棒性的研究主要聚焦于基于梯度的特征归因,而对于基于移除的归因方法的鲁棒性特性尚缺乏全面深入的理解。为弥合这一研究空白,我们从理论上刻画了基于移除的特征归因方法的鲁棒性。具体而言,我们对此类方法进行了统一分析,并在输入扰动与模型扰动两种场景下,证明了完整归因与扰动归因差异的上界。基于合成数据与真实数据的实证实验验证了我们的理论结果,并展示了其实际应用意义。