Evaluation of artificial intelligence (AI) models for low-dose CT lung cancer screening is limited by heterogeneous datasets, annotation standards, and evaluation protocols, making performance difficult to compare and translate across clinical settings. We establish a public, reproducible multi-dataset benchmark for lung nodule detection and nodule-level cancer classification and quantify cross-dataset generalizability. Using the Duke Lung Cancer Screening (DLCS) dataset as a clinically curated development set, we evaluate performance across LUNA16/LIDC-IDRI, NLST-3D, and LUNA25. Detection models trained on DLCS and LUNA16 were evaluated externally on NLST-3D using free-response ROC analysis. For malignancy classification, we compared five strategies: randomly initialized ResNet50, Models Genesis, Med3D, a Foundation Model for Cancer Biomarkers, and a Strategic Warm-Start (ResNet50-SWS) approach pretrained using detection-derived candidate patches stratified by confidence. Performance was summarized using AUC with 95% confidence intervals and DeLong tests. Detection performance varied substantially by training dataset, with DLCS-trained models outperforming LUNA16-trained models on external NLST-3D evaluation (sensitivity at 2 false positives per scan: 0.72 vs. 0.64; p < 0.001). For malignancy classification, ResNet50-SWS achieved AUCs of 0.71 (DLCS), 0.90 (LUNA16), 0.81 (NLST-3D), and 0.80 (LUNA25), consistently matching or exceeding alternative pretraining strategies. These results demonstrate that dataset characteristics strongly influence lung cancer AI performance and highlight the need for transparent, multi-dataset benchmarking.
翻译:低剂量CT肺癌筛查人工智能模型的评估受到数据集异质性、标注标准及评估协议多样化的限制,导致不同临床环境下的性能难以比较与转化。本研究建立了一个公开、可复现的多数据集基准,用于肺结节检测与结节级别癌症分类,并量化跨数据集的泛化能力。以杜克肺癌筛查数据集作为临床标注的开发集,我们在LUNA16/LIDC-IDRI、NLST-3D和LUNA25数据集上进行性能评估。基于DLCS和LUNA16训练的检测模型通过自由响应ROC分析在NLST-3D上进行外部验证。针对恶性程度分类,我们比较了五种策略:随机初始化的ResNet50、Models Genesis、Med3D、癌症生物标志物基础模型,以及采用检测衍生候选图像块(按置信度分层)进行预训练的战略性热启动方法。性能通过AUC(含95%置信区间)及DeLong检验进行综合评估。检测性能受训练数据集影响显著:在外部NLST-3D评估中,基于DLCS训练的模型优于基于LUNA16训练的模型(每扫描2个假阳性时的敏感度:0.72 vs. 0.64;p < 0.001)。在恶性程度分类任务中,ResNet50-SWS在DLCS、LUNA16、NLST-3D和LUNA25数据集上分别取得0.71、0.90、0.81和0.80的AUC值,其性能持续匹配或超越其他预训练策略。这些结果表明数据集特性对肺癌AI性能具有重要影响,并凸显了透明化多数据集基准测试的必要性。