Purpose. To compare deep learning architectures and classification schemes for dermoscopic images of skin neoplasms and assess their generalization on transfer from open international datasets to independent clinical datasets of Russian practice. Methods. Four architectures (ViT-B/16, Swin-S, ConvNeXt-S, EfficientNetV2-S) were compared in three schemes: binary (malignant/benign), single-stage four-class (benign, MEL, SCC, BCC), and a two-stage cascade (binary triage, then three-class differentiation MEL/SCC/BCC). All models used ImageNet-pretrained weights and a single augmentation protocol on aggregated open ISIC Archive data, and were evaluated on an internal held-out sample and two clinical datasets (Melanoscope AI mobile system; Sechenov University). Results. Internally the binary stage attains ROC-AUC 0.952-0.966; on Sechenov University it drops to 0.797-0.893, sensitivity to 0.53-0.67, and ECE rises from 0.02 to 0.27-0.39 with underestimation of malignancy, quantifying a generalization gap in ranking and calibration. Paired tests confirm one inter-architecture result on clinical data: the deficit of ViT-B/16 at the binary stage (p<0.05); at the differentiation stage no architecture has a proven advantage. The cascade raises macro F1 over single-stage four-class classification for most architectures, but significantly only for ViT-B/16, by recovering malignant lesions assigned to the dominant benign class. On ISIC MILK10k, direct 11-class classification yields mean-class sensitivity 0.525. Conclusion. A tunable triage threshold gives sensitivity control not attainable in standard single-stage (argmax) classification and better reproduces clinical differential-diagnosis logic. The persistent generalization gap mandates external clinical validation and recalibration before deployment.
翻译:目的:比较用于皮肤肿瘤皮肤镜图像的深度学习架构与分类方案,并评估其从公开国际数据集向俄罗斯临床独立数据集的泛化能力。方法:在三种方案中比较四种架构(ViT-B/16、Swin-S、ConvNeXt-S、EfficientNetV2-S):二分类(恶性/良性)、单阶段四分类(良性、黑色素瘤、鳞状细胞癌、基底细胞癌)以及两阶段级联(二分类预筛,然后对黑色素瘤/鳞状细胞癌/基底细胞癌进行三分类鉴别)。所有模型使用ImageNet预训练权重,在聚合的公开ISIC存档数据上采用单一增强方案,并在内部保留样本及两个临床数据集(Melanoscope AI移动系统;谢切诺夫大学)上进行评估。结果:内部二分类阶段的ROC-AUC为0.952-0.966;在谢切诺夫大学数据集上降至0.797-0.893,灵敏度降至0.53-0.67,预期校准误差从0.02升至0.27-0.39,且低估恶性程度,量化了排序与校准中的泛化差距。配对检验确认了临床数据上的一项架构间差异:在二分类阶段ViT-B/16存在不足(p<0.05);在鉴别阶段,无任何架构具有明确优势。级联方案使多数架构的宏F1值高于单阶段四分类,但仅对ViT-B/16的提升显著(通过恢复被归入占优良性类别的恶性病灶)。在ISIC MILK10k上,直接十一分类的平均类别灵敏度为0.525。结论:可调预筛阈值实现了标准单阶段(argmax)分类无法达到的灵敏度控制,并能更好地复现临床鉴别诊断逻辑。持续的泛化差距要求在部署前进行外部临床验证与再校准。