Interpreting the inner function of neural networks is crucial for the trustworthy development and deployment of these black-box models. Prior interpretability methods focus on correlation-based measures to attribute model decisions to individual examples. However, these measures are susceptible to noise and spurious correlations encoded in the model during the training phase (e.g., biased inputs, model overfitting, or misspecification). Moreover, this process has proven to result in noisy and unstable attributions that prevent any transparent understanding of the model's behavior. In this paper, we develop a robust interventional-based method grounded by causal analysis to capture cause-effect mechanisms in pre-trained neural networks and their relation to the prediction. Our novel approach relies on path interventions to infer the causal mechanisms within hidden layers and isolate relevant and necessary information (to model prediction), avoiding noisy ones. The result is task-specific causal explanatory graphs that can audit model behavior and express the actual causes underlying its performance. We apply our method to vision models trained on classification tasks. On image classification tasks, we provide extensive quantitative experiments to show that our approach can capture more stable and faithful explanations than standard attribution-based methods. Furthermore, the underlying causal graphs reveal the neural interactions in the model, making it a valuable tool in other applications (e.g., model repair).
翻译:理解神经网络的内部功能对于这些黑箱模型的可信开发和部署至关重要。现有的可解释性方法主要依赖基于相关性的度量来将模型决策归因于单个样本。然而,这些度量容易受到训练阶段编码在模型中的噪声和虚假相关性的影响(例如有偏输入、模型过拟合或设定错误)。此外,该过程已被证明会产生噪声大且不稳定的归因,阻碍了对模型行为的透明理解。本文开发了一种基于因果分析的基础性鲁棒干预方法,用于捕捉预训练神经网络中的因果机制及其与预测的关系。我们的新颖方法依赖路径干预来推断隐藏层内的因果机制,并隔离与模型预测相关且必要的信息,避免噪声信息。结果是任务特定的因果解释图,可用于审计模型行为并表达其性能背后的实际原因。我们将该方法应用于在分类任务上训练的视觉模型。在图像分类任务上,我们提供了大量定量实验,表明我们的方法能够比标准基于归因的方法捕获更稳定且更忠实的解释。此外,底层因果图揭示了模型中的神经交互作用,使其成为其他应用(如模型修复)中的宝贵工具。