To explain predictions made by complex machine learning models, many feature attribution methods have been developed that assign importance scores to input features. Some recent work challenges the robustness of these methods by showing that they are sensitive to input and model perturbations, while other work addresses this issue by proposing robust attribution methods. However, previous work on attribution robustness has focused primarily on gradient-based feature attributions, whereas the robustness of removal-based attribution methods is not currently well understood. To bridge this gap, we theoretically characterize the robustness properties of removal-based feature attributions. Specifically, we provide a unified analysis of such methods and derive upper bounds for the difference between intact and perturbed attributions, under settings of both input and model perturbations. Our empirical results on synthetic and real-world data validate our theoretical results and demonstrate their practical implications, including the ability to increase attribution robustness by improving the model's Lipschitz regularity.
翻译:为了解释复杂机器学习模型做出的预测,研究人员开发了许多特征归因方法,这些方法为输入特征分配重要性分数。近期一些研究通过展示这些方法对输入和模型扰动敏感来质疑其鲁棒性,而另一些研究则通过提出鲁棒归因方法来解决该问题。然而,先前关于归因鲁棒性的研究主要聚焦于基于梯度的特征归因,而基于移除的归因方法的鲁棒性目前尚不明确。为填补这一空白,我们从理论上刻画了基于移除的特征归因的鲁棒性特性。具体而言,我们对这类方法进行了统一分析,并在输入扰动和模型扰动两种场景下,推导了完整归因与扰动归因之间差异的上界。我们在合成数据和真实数据上的实验结果验证了理论结果,并展示了其实际意义,包括通过提升模型的Lipschitz正则化程度来增强归因鲁棒性的能力。