Explanation methods for machine learning models tend to not provide any formal guarantees and may not reflect the underlying decision-making process. In this work, we analyze stability as a property for reliable feature attribution methods. We prove that relaxed variants of stability are guaranteed if the model is sufficiently Lipschitz with respect to the masking of features. To achieve such a model, we develop a smoothing method called Multiplicative Smoothing (MuS). We show that MuS overcomes theoretical limitations of standard smoothing techniques and can be integrated with any classifier and feature attribution method. We evaluate MuS on vision and language models with a variety of feature attribution methods, such as LIME and SHAP, and demonstrate that MuS endows feature attributions with non-trivial stability guarantees.
翻译:机器学习模型的解释方法通常不提供任何形式化保证,且可能无法反映底层决策过程。本研究分析了稳定性作为可靠特征归因方法的一个属性。我们证明了若模型在特征遮蔽方面具有充分利普希茨性质,则可保证稳定性的松弛变体。为实现此类模型,我们开发了一种名为乘法平滑(MuS)的平滑方法。研究表明,MuS克服了标准平滑技术的理论局限性,且可与任意分类器及特征归因方法集成。我们在视觉与语言模型上结合LIME、SHAP等多种特征归因方法对MuS进行评测,证明MuS能为特征归因赋予非平凡的稳定性保证。