Interpretability methods are valuable only if their explanations faithfully describe the explained model. In this work, we consider neural networks whose predictions are invariant under a specific symmetry group. This includes popular architectures, ranging from convolutional to graph neural networks. Any explanation that faithfully explains this type of model needs to be in agreement with this invariance property. We formalize this intuition through the notion of explanation invariance and equivariance by leveraging the formalism from geometric deep learning. Through this rigorous formalism, we derive (1) two metrics to measure the robustness of any interpretability method with respect to the model symmetry group; (2) theoretical robustness guarantees for some popular interpretability methods and (3) a systematic approach to increase the invariance of any interpretability method with respect to a symmetry group. By empirically measuring our metrics for explanations of models associated with various modalities and symmetry groups, we derive a set of 5 guidelines to allow users and developers of interpretability methods to produce robust explanations.
翻译:可解释性方法的价值取决于其解释能否忠实描述被解释模型。本研究聚焦于预测结果在特定对称群下保持不变的神经网络,涵盖了从卷积神经网络到图神经网络等主流架构。任何能够忠实解释此类模型的解释方法都必须与这种不变性属性保持一致。我们借助几何深度学习的形式化框架,通过引入解释不变性与等变性的概念将这一直觉形式化。基于这一严谨的形式化方法,我们推导出:(1)两个用于衡量任意可解释性方法相对于模型对称群鲁棒性的指标;(2)若干主流可解释性方法的理论鲁棒性保证;(3)一种系统化提升任意可解释性方法相对于对称群不变性的方法论。通过对多种模态及对称群对应的模型解释进行实证指标测量,我们总结出5项指导原则,旨在帮助可解释性方法的用户与开发者生成鲁棒的解释。