Prior works have shown that neural networks can be heavily pruned while preserving performance, but the impact of pruning on model interpretability remains unclear. In this work, we investigate how magnitude-based pruning followed by fine-tuning affects both low-level saliency maps and high-level concept representations. Using a ResNet-18 trained on ImageNette, we compare post-hoc explanations from Vanilla Gradients (VG) and Integrated Gradients (IG) across pruning levels, evaluating sparsity and faithfulness. We further apply CRAFT-based concept extraction to track changes in semantic coherence of learned concepts. Our results show that light-to-moderate pruning improves saliency-map focus and faithfulness while retaining distinct, semantically meaningful concepts. In contrast, aggressive pruning merges heterogeneous features, reducing saliency map sparsity and concept coherence despite maintaining accuracy. These findings suggest that while pruning can shape internal representations toward more human-aligned attention patterns, excessive pruning undermines interpretability.
翻译:先前的研究表明,神经网络可以在保持性能的同时被大幅剪枝,但剪枝对模型可解释性的影响尚不明确。本研究探讨了基于幅度的剪枝及后续微调如何影响低层显著性图与高层概念表示。我们使用在ImageNette上训练的ResNet-18,比较了不同剪枝水平下Vanilla Gradients(VG)和Integrated Gradients(IG)的事后解释,并评估了其稀疏性与忠实性。进一步,我们应用基于CRAFT的概念提取方法来追踪所学概念语义一致性的变化。结果表明,轻度至中度剪枝在保留清晰、语义明确概念的同时,能提升显著性图的聚焦程度与忠实性。相反,激进剪枝会融合异质特征,尽管准确率得以保持,却降低了显著性图的稀疏性与概念一致性。这些发现表明,虽然剪枝能够使内部表征趋向更符合人类认知的注意力模式,但过度剪枝会损害可解释性。