Class Activation Mapping (CAM) methods are widely used to generate visual explanations for deep learning classifiers in medical imaging. However, existing evaluation frameworks assess whether explanations are correct, measured by localisation fidelity against radiologist annotations, rather than whether they are consistent: whether the model applies the same spatial reasoning strategy across different patients with the same pathology. We propose the C-Score (Consistency Score), a confidence-weighted, annotation-free metric that quantifies intra-class explanation reproducibility via intensity-emphasised pairwise soft IoU across correctly classified instances. We evaluate six CAM techniques: GradCAM, GradCAM++, LayerCAM, EigenCAM, ScoreCAM, and MS GradCAM++ across three CNN architectures (DenseNet201, InceptionV3, ResNet50V2) over thirty training epochs on the Kermany chest X-ray dataset, covering transfer learning and fine-tuning phases. We identify three distinct mechanisms of AUC-consistency dissociation, invisible to standard classification metrics: threshold-mediated gold list collapse, technique-specific attribution collapse at peak AUC, and class-level consistency masking in global aggregation. C-Score provides an early warning signal of impending model instability. ScoreCAM deterioration on ResNet50V2 is detectable one full checkpoint before catastrophic AUC collapse and yields architecture-specific clinical deployment recommendations grounded in explanation quality rather than predictive ranking alone.
翻译:摘要:类激活映射(CAM)方法广泛用于生成医学图像中深度学习分类器的可视化解释。然而,现有评估框架仅衡量解释的正确性(即通过与放射科医生标注的定位保真度进行对比),而非一致性——即模型是否对患有相同病理的不同患者应用相同的空间推理策略。本文提出C-Score(一致性得分),一种基于置信度加权、无需标注的度量指标,通过强度加权的逐对软交并比(soft IoU)量化正确分类实例中的类内解释可复现性。我们在Kermany胸部X光数据集上,覆盖迁移学习与微调阶段,使用三种CNN架构(DenseNet201、InceptionV3、ResNet50V2)对六种CAM技术(GradCAM、GradCAM++、LayerCAM、EigenCAM、ScoreCAM和MS GradCAM++)进行了三十个训练周期的评估。我们识别出三种标准分类指标无法察觉的AUC-一致性解耦机制:阈值介导的金标准列表坍缩、峰值AUC处技术特异性归因坍缩,以及全局聚合中的类别级别一致性掩蔽。C-Score可提供模型即将不稳定的早期预警信号——在灾难性AUC崩塌前一个完整检查点即可检测到ResNet50V2上ScoreCAM的性能退化,并基于解释质量而非单纯预测排名给出架构特异性的临床部署建议。