Interpreting the inner function of neural networks is crucial for the trustworthy development and deployment of these black-box models. Prior interpretability methods focus on correlation-based measures to attribute model decisions to individual examples. However, these measures are susceptible to noise and spurious correlations encoded in the model during the training phase (e.g., biased inputs, model overfitting, or misspecification). Moreover, this process has proven to result in noisy and unstable attributions that prevent any transparent understanding of the model's behavior. In this paper, we develop a robust interventional-based method grounded by causal analysis to capture cause-effect mechanisms in pre-trained neural networks and their relation to the prediction. Our novel approach relies on path interventions to infer the causal mechanisms within hidden layers and isolate relevant and necessary information (to model prediction), avoiding noisy ones. The result is task-specific causal explanatory graphs that can audit model behavior and express the actual causes underlying its performance. We apply our method to vision models trained on classification tasks. On image classification tasks, we provide extensive quantitative experiments to show that our approach can capture more stable and faithful explanations than standard attribution-based methods. Furthermore, the underlying causal graphs reveal the neural interactions in the model, making it a valuable tool in other applications (e.g., model repair).
翻译:解读神经网络的内部机制对于可信赖地开发和部署这些黑箱模型至关重要。现有可解释性方法主要基于相关性度量,将模型决策归因于单个样本。然而,这些度量容易受到模型训练阶段编码的噪声和虚假相关性(如偏差输入、模型过拟合或设定错误)的影响。此外,这一过程已被证明会产生噪声大且不稳定的归因结果,阻碍对模型行为的透明理解。本文提出了一种基于因果分析的鲁棒干预方法,用于捕捉预训练神经网络中的因果机制及其与预测的关系。我们的新方法依赖路径干预推断隐藏层内的因果机制,分离与模型预测相关且必要的信息,同时避免噪声干扰。最终生成任务特定的因果解释图,能够审计模型行为并揭示其性能背后的真实原因。我们将该方法应用于基于分类任务训练的视觉模型。在图像分类任务上,通过大量定量实验表明,我们的方法相比标准归因方法能捕获更稳定、更忠实的解释。此外,底层因果图揭示了模型中的神经交互作用,使其成为其他应用(如模型修复)中的重要工具。