Deep learning models have revolutionized the field of medical image analysis, offering significant promise for improved diagnostics and patient care. However, their performance can be misleadingly optimistic due to a hidden pitfall called 'data leakage'. In this study, we investigate data leakage in 3D medical imaging, specifically using 3D Convolutional Neural Networks (CNNs) for brain MRI analysis. While 3D CNNs appear less prone to leakage than 2D counterparts, improper data splitting during cross-validation (CV) can still pose issues, especially with longitudinal imaging data containing repeated scans from the same subject. We explore the impact of different data splitting strategies on model performance for longitudinal brain MRI analysis and identify potential data leakage concerns. GradCAM visualization helps reveal shortcuts in CNN models caused by identity confounding, where the model learns to identify subjects along with diagnostic features. Our findings, consistent with prior research, underscore the importance of subject-wise splitting and evaluating our model further on hold-out data from different subjects to ensure the integrity and reliability of deep learning models in medical image analysis.
翻译:深度学习模型革新了医学图像分析领域,为改善诊断和患者护理带来了巨大希望。然而,由于一个名为“数据泄露”的隐藏陷阱,其性能可能呈现出误导性的乐观表现。在本研究中,我们探讨了3D医学成像中的数据泄露问题,特别是使用3D卷积神经网络(CNN)进行脑部MRI分析。尽管3D CNN似乎比2D CNN更不易出现数据泄露,但在交叉验证(CV)过程中不当的数据拆分仍可能引发问题,尤其是当纵向成像数据包含同一受试者的重复扫描时。我们研究了不同数据拆分策略对纵向脑部MRI分析模型性能的影响,并识别了潜在的数据泄露风险。GradCAM可视化有助于揭示因身份混淆导致的CNN模型捷径学习现象——模型在学习诊断特征的同时学会了识别受试者身份。我们的发现与先前研究一致,强调了按受试者拆分数据的重要性,并建议在来自不同受试者的留存数据上进一步评估模型,以确保深度学习模型在医学图像分析中的完整性和可靠性。