Causal Analysis for Robust Interpretability of Neural Networks

Interpreting the inner function of neural networks is crucial for the trustworthy development and deployment of these black-box models. Prior interpretability methods focus on correlation-based measures to attribute model decisions to individual examples. However, these measures are susceptible to noise and spurious correlations encoded in the model during the training phase (e.g., biased inputs, model overfitting, or misspecification). Moreover, this process has proven to result in noisy and unstable attributions that prevent any transparent understanding of the model's behavior. In this paper, we develop a robust interventional-based method grounded by causal analysis to capture cause-effect mechanisms in pre-trained neural networks and their relation to the prediction. Our novel approach relies on path interventions to infer the causal mechanisms within hidden layers and isolate relevant and necessary information (to model prediction), avoiding noisy ones. The result is task-specific causal explanatory graphs that can audit model behavior and express the actual causes underlying its performance. We apply our method to vision models trained on classification tasks. On image classification tasks, we provide extensive quantitative experiments to show that our approach can capture more stable and faithful explanations than standard attribution-based methods. Furthermore, the underlying causal graphs reveal the neural interactions in the model, making it a valuable tool in other applications (e.g., model repair).

翻译：理解神经网络的内部功能对于这些黑箱模型的可信开发和部署至关重要。现有的可解释性方法主要依赖基于相关性的度量来将模型决策归因于单个样本。然而，这些度量容易受到训练阶段编码在模型中的噪声和虚假相关性的影响（例如有偏输入、模型过拟合或设定错误）。此外，该过程已被证明会产生噪声大且不稳定的归因，阻碍了对模型行为的透明理解。本文开发了一种基于因果分析的基础性鲁棒干预方法，用于捕捉预训练神经网络中的因果机制及其与预测的关系。我们的新颖方法依赖路径干预来推断隐藏层内的因果机制，并隔离与模型预测相关且必要的信息，避免噪声信息。结果是任务特定的因果解释图，可用于审计模型行为并表达其性能背后的实际原因。我们将该方法应用于在分类任务上训练的视觉模型。在图像分类任务上，我们提供了大量定量实验，表明我们的方法能够比标准基于归因的方法捕获更稳定且更忠实的解释。此外，底层因果图揭示了模型中的神经交互作用，使其成为其他应用（如模型修复）中的宝贵工具。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日