Trusting Right Predictions for Wrong Reasons: A LIME Based Analysis of Deep Learning Interpretability in Lung Cancer Diagnosis

Lung cancer is the leading cause of cancer-related mortality, with approximately 2.5 million new cases and 1.8 million deaths annually, making reliable diagnosis a clinical priority. Although deep learning models have achieved strong performance in lung cancer classification, evaluation has largely focused on predictive accuracy, leaving their decision-making processes insufficiently examined. This study compares three architecturally distinct models: a Convolutional Neural Network (CNN), a pretrained ResNet50, and a Vision Transformer (ViT), trained on the IQ-OTH/NCCD lung cancer CT dataset. Local Interpretable Model-Agnostic Explanations (LIME) were applied to investigate model reasoning. In addition to standard performance metrics, a dual-correlation framework was introduced to measure both prediction agreement and explanation agreement across model pairs. All three models achieved strong classification performance, with ResNet50 attaining 98.61% accuracy, CNN 97.91%, and ViT 93.75%, while all achieved ROC-AUC scores of 0.99. Prediction correlations exceeded 0.99 across all model pairs, indicating highly consistent outputs. However, LIME explanation correlations remained below 0.26, revealing substantial differences in the image regions used to reach those predictions. Analysis of misclassified samples further identified a consistent spatial pattern: incorrect predictions were associated with attention outside the lung parenchyma, whereas correct predictions focused primarily within lung regions. These findings demonstrate that prediction agreement is a poor proxy for reasoning consistency, and that interpretability evaluation must be treated as an independent validation criterion alongside predictive performance in clinical AI systems.

翻译：肺癌是癌症相关死亡的首要原因，每年约250万新发病例和180万死亡病例，使可靠诊断成为临床优先事项。尽管深度学习模型在肺癌分类中取得了优异性能，但评估主要聚焦于预测准确性，其决策过程仍未得到充分检验。本研究比较了三种架构不同的模型：卷积神经网络（CNN）、预训练ResNet50和视觉Transformer（ViT），这些模型均在IQ-OTH/NCCD肺癌CT数据集上训练。采用局部可解释模型无关解释（LIME）探究模型推理过程。除标准性能指标外，还引入双相关性框架，同时测量模型对之间的预测一致性与解释一致性。三个模型均展现出强大的分类性能，其中ResNet50准确率达98.61%，CNN达97.91%，ViT达93.75%，且所有模型的ROC-AUC分数均达0.99。所有模型对的预测相关性均超过0.99，表明输出高度一致。然而，LIME解释相关性均低于0.26，揭示模型在达成预测时使用的图像区域存在显著差异。对误分类样本的进一步分析识别出一致的空间模式：错误预测与注意力偏离肺实质相关，而正确预测主要聚焦于肺部区域。这些发现表明，预测一致性并不能充分代表推理一致性，在临床AI系统中，可解释性评估必须作为独立验证准则与预测性能并重。