Congenital heart disease remains the most common congenital anomaly and a leading cause of neonatal morbidity and mortality. Although first-trimester fetal echocardiography offers an opportunity for earlier detection, automated analysis at this stage is challenging due to small cardiac structures, low signal-to-noise ratio, and substantial inter-operator variability. In this work, we evaluate a self-supervised ultrasound foundation model, USF-MAE, for first-trimester fetal heart view classification. USF-MAE is pretrained using masked autoencoding modelling on more than 370,000 unlabelled ultrasound images spanning over 40 anatomical regions and is subsequently fine-tuned for downstream classification. As a proof of concept, the pretrained Vision Transformer encoder was fine-tuned on an open-source dataset of 6,720 first-trimester fetal echocardiography images to classify five categories: aorta, atrioventricular flows, V sign, X sign, and Other. Model performance was benchmarked against supervised convolutional neural network baselines (ResNet-18 and ResNet-50) and a Vision Transformer (ViT-B/16) model pretrained on natural images (ImageNet-1k). All models were trained and evaluated using identical preprocessing, data splits, and optimization protocols. On an independent test set, USF-MAE achieved the highest performance across all evaluation metrics, with 90.57% accuracy, 91.15% precision, 90.57% recall, and 90.71% F1-score. This represents an improvement of +2.03% in accuracy and +1.98% in F1-score compared with the strongest baseline, ResNet-18. The proposed approach demonstrated robust performance without reliance on aggressive image preprocessing or region-of-interest cropping and showed improved discrimination of non-diagnostic frames.
翻译:先天性心脏病仍是最常见的先天性异常,也是导致新生儿发病和死亡的主要原因。尽管早孕期胎儿超声心动图提供了早期检测的机会,但由于此阶段心脏结构微小、信噪比低以及操作者间存在显著差异,自动化分析面临挑战。本研究评估了一种自监督超声基础模型USF-MAE在早孕期胎儿心脏切面分类任务中的表现。USF-MAE通过掩码自编码建模在超过37万张涵盖40余个解剖区域的未标注超声图像上进行预训练,随后针对下游分类任务进行微调。作为概念验证,该预训练的Vision Transformer编码器在一个包含6720张早孕期胎儿超声心动图图像的开源数据集上进行微调,以对主动脉、房室血流、V征、X征及其他类别这五种类别进行分类。模型性能以监督式卷积神经网络基线模型(ResNet-18与ResNet-50)以及在自然图像数据集(ImageNet-1k)上预训练的Vision Transformer模型(ViT-B/16)作为基准进行比较。所有模型均采用相同的预处理、数据划分及优化协议进行训练与评估。在独立测试集上,USF-MAE在所有评估指标中均取得最佳性能,准确率达90.57%,精确率为91.15%,召回率为90.57%,F1分数为90.71%。相较于最强基线模型ResNet-18,其准确率提升了2.03%,F1分数提升了1.98%。所提出的方法在不依赖激进图像预处理或感兴趣区域裁剪的情况下展现出鲁棒性能,并提升了对非诊断性帧的判别能力。