IMPORTANCE: Modern ultrasound systems are universal diagnostic tools capable of imaging the entire body. However, current AI solutions remain fragmented into single-task tools. This critical gap between hardware versatility and software specificity limits workflow integration and clinical utility. OBJECTIVE: To evaluate the diagnostic accuracy, versatility, and efficiency of single general-purpose deep learning models for multi-organ classification and segmentation. DESIGN: The Universal UltraSound Image Challenge 2025 (UUSIC25) involved developing algorithms on 11,644 images aggregated from 12 sources (9 public, 3 private). Evaluation used an independent, multi-center private test set of 2,479 images, including data from a center completely unseen during training to assess generalization. OUTCOMES: Diagnostic performance (Dice Similarity Coefficient [DSC]; Area Under the Receiver Operating Characteristic Curve [AUC]) and computational efficiency (inference time, GPU memory). RESULTS: Of 15 valid algorithms, the top model (SMART) achieved a macro-averaged DSC of 0.854 across 5 segmentation tasks and AUC of 0.766 for binary classification. Models demonstrated high capability in anatomical segmentation (e.g., fetal head DSC: 0.942) but variability in complex diagnostic tasks subject to domain shift. Specifically, in breast cancer molecular subtyping, the top model's performance dropped from an AUC of 0.571 (internal) to 0.508 (unseen external center), highlighting the challenge of generalization. CONCLUSIONS: General-purpose AI models can achieve high accuracy and efficiency across multiple tasks using a single architecture. However, significant performance degradation on unseen data suggests domain generalization is critical for future clinical deployment.
翻译:重要性:现代超声系统是能够对整个身体进行成像的通用诊断工具。然而,当前的人工智能解决方案仍然分散为单一任务的工具。硬件通用性与软件专用性之间的这一关键差距限制了工作流程整合与临床实用性。目的:评估用于多器官分类与分割的单一通用深度学习模型的诊断准确性、通用性和效率。设计:2025年通用超声图像挑战赛(UUSIC25)涉及使用来自12个来源(9个公开,3个私有)汇总的11,644张图像开发算法。评估使用一个独立的、多中心私有测试集,包含2,479张图像,其中包括一个在训练期间完全未见过的中心的数据,以评估泛化能力。结局指标:诊断性能(Dice相似系数[DSC];受试者工作特征曲线下面积[AUC])和计算效率(推理时间,GPU内存)。结果:在15个有效算法中,排名第一的模型(SMART)在5个分割任务中实现了0.854的宏平均DSC,在二元分类中实现了0.766的AUC。模型在解剖结构分割方面表现出高能力(例如,胎儿头部DSC:0.942),但在受域偏移影响的复杂诊断任务中表现存在差异。具体而言,在乳腺癌分子亚型分型中,排名第一的模型的性能从AUC 0.571(内部)下降到0.508(未见过的外部中心),突显了泛化的挑战。结论:通用人工智能模型能够使用单一架构在多项任务中实现高准确性和高效率。然而,在未见数据上的显著性能下降表明,领域泛化对于未来的临床部署至关重要。