Reproducible Benchmarking for Lung Nodule Detection and Malignancy Classification Across Multiple Low-Dose CT Datasets

Fakrul Islam Tushar,Avivah Wang,Lavsen Dahal,Ehsan Samei,Michael R. Harowicz,Jayashree Kalpathy-Cramer,Kyle J. Lafata,Tina D. Tailor,Cynthia Rudin,Joseph Y. Lo

from arxiv, 3 tables, 2 supplement tables, 5 figures

Evaluation of artificial intelligence (AI) models for low-dose CT lung cancer screening is limited by heterogeneous datasets, annotation standards, and evaluation protocols, making performance difficult to compare and translate across clinical settings. We establish a public, reproducible multi-dataset benchmark for lung nodule detection and nodule-level cancer classification and quantify cross-dataset generalizability. Using the Duke Lung Cancer Screening (DLCS) dataset as a clinically curated development set, we evaluate performance across LUNA16/LIDC-IDRI, NLST-3D, and LUNA25. Detection models trained on DLCS and LUNA16 were evaluated externally on NLST-3D using free-response ROC analysis. For malignancy classification, we compared five strategies: randomly initialized ResNet50, Models Genesis, Med3D, a Foundation Model for Cancer Biomarkers, and a Strategic Warm-Start (ResNet50-SWS) approach pretrained using detection-derived candidate patches stratified by confidence. Performance was summarized using AUC with 95% confidence intervals and DeLong tests. Detection performance varied substantially by training dataset, with DLCS-trained models outperforming LUNA16-trained models on external NLST-3D evaluation (sensitivity at 2 false positives per scan: 0.72 vs. 0.64; p < 0.001). For malignancy classification, ResNet50-SWS achieved AUCs of 0.71 (DLCS), 0.90 (LUNA16), 0.81 (NLST-3D), and 0.80 (LUNA25), consistently matching or exceeding alternative pretraining strategies. These results demonstrate that dataset characteristics strongly influence lung cancer AI performance and highlight the need for transparent, multi-dataset benchmarking.

翻译：低剂量CT肺癌筛查人工智能模型的评估受到数据集异质性、标注标准及评估协议多样化的限制，导致不同临床环境下的性能难以比较与转化。本研究建立了一个公开、可复现的多数据集基准，用于肺结节检测与结节级别癌症分类，并量化跨数据集的泛化能力。以杜克肺癌筛查数据集作为临床标注的开发集，我们在LUNA16/LIDC-IDRI、NLST-3D和LUNA25数据集上进行性能评估。基于DLCS和LUNA16训练的检测模型通过自由响应ROC分析在NLST-3D上进行外部验证。针对恶性程度分类，我们比较了五种策略：随机初始化的ResNet50、Models Genesis、Med3D、癌症生物标志物基础模型，以及采用检测衍生候选图像块（按置信度分层）进行预训练的战略性热启动方法。性能通过AUC（含95%置信区间）及DeLong检验进行综合评估。检测性能受训练数据集影响显著：在外部NLST-3D评估中，基于DLCS训练的模型优于基于LUNA16训练的模型（每扫描2个假阳性时的敏感度：0.72 vs. 0.64；p < 0.001）。在恶性程度分类任务中，ResNet50-SWS在DLCS、LUNA16、NLST-3D和LUNA25数据集上分别取得0.71、0.90、0.81和0.80的AUC值，其性能持续匹配或超越其他预训练策略。这些结果表明数据集特性对肺癌AI性能具有重要影响，并凸显了透明化多数据集基准测试的必要性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

上海市数字医学创新中心：2022中国智慧数字病理行业发展白皮书（附报告），56页pdf

专知会员服务

45+阅读 · 2023年1月30日

Cancer Cell综述｜AI用于肿瘤学中的多模态数据集成

专知会员服务

35+阅读 · 2022年10月13日

视觉Transformer预训练模型的胸腔X线影像多标签分类

专知会员服务

14+阅读 · 2022年7月29日

南大清华等《深度学习蛋白质设计》综述论文，涵盖16页pdf153篇文献阐述DL在蛋白质结构与序列设计的方法

专知会员服务

22+阅读 · 2022年6月1日