NeuroInspect: Interpretable Neuron-based Debugging Framework through Class-conditional Visualizations

Despite deep learning (DL) has achieved remarkable progress in various domains, the DL models are still prone to making mistakes. This issue necessitates effective debugging tools for DL practitioners to interpret the decision-making process within the networks. However, existing debugging methods often demand extra data or adjustments to the decision process, limiting their applicability. To tackle this problem, we present NeuroInspect, an interpretable neuron-based debugging framework with three key stages: counterfactual explanations, feature visualizations, and false correlation mitigation. Our debugging framework first pinpoints neurons responsible for mistakes in the network and then visualizes features embedded in the neurons to be human-interpretable. To provide these explanations, we introduce CLIP-Illusion, a novel feature visualization method that generates images representing features conditioned on classes to examine the connection between neurons and the decision layer. We alleviate convoluted explanations of the conventional visualization approach by employing class information, thereby isolating mixed properties. This process offers more human-interpretable explanations for model errors without altering the trained network or requiring additional data. Furthermore, our framework mitigates false correlations learned from a dataset under a stochastic perspective, modifying decisions for the neurons considered as the main causes. We validate the effectiveness of our framework by addressing false correlations and improving inferences for classes with the worst performance in real-world settings. Moreover, we demonstrate that NeuroInspect helps debug the mistakes of DL models through evaluation for human understanding. The code is openly available at https://github.com/yeongjoonJu/NeuroInspect.

翻译：尽管深度学习（DL）已在多个领域取得显著进展，但其模型仍易出错。这一问题亟需有效的调试工具，帮助DL从业者理解网络内部的决策过程。然而，现有调试方法通常需要额外数据或调整决策过程，限制了其适用性。为解决此问题，我们提出NeuroInspect——一个可解释的基于神经元的调试框架，包含三个关键阶段：反事实解释、特征可视化和虚假关联缓解。该框架首先定位网络中对错误负责的神经元，随后将这些神经元嵌入的特征可视化为人类可理解的形式。为提供这些解释，我们引入CLIP-Illusion，一种新颖的特征可视化方法，通过生成基于类条件特征表征的图像，探究神经元与决策层之间的关联。通过利用类别信息分离混合属性，我们改善了传统可视化方法中复杂的解释。这一过程无需修改已训练网络或额外数据，即可提供更具可理解性的模型错误解释。此外，该框架在随机视角下缓解数据集中的虚假关联学习，针对被视为主要成因的神经元调整决策。我们通过解决真实场景中性能最差类别的虚假关联并改善其推理，验证了框架的有效性。进一步通过人类理解评估，证明NeuroInspect有助于调试DL模型的错误。代码已开源在https://github.com/yeongjoonJu/NeuroInspect。