Benchmarking Ultrasound Foundation Models for Fetal Plane Classification

Ultrasound is widely used in obstetric care due to its safety, accessibility, and real-time imaging. However, interpretation remains operator-dependent and susceptible to noise and artifacts. Deep learning models have shown strong performance to solve these problem, but they typically require large annotated datasets that are difficult to obtain in clinical ultrasound. Foundation models (FMs) offer an alternative, using a large number of ultrasound images to learn transferable representations that can generalize with limited labeled data. This work presents a comprehensive benchmark of ultrasound-specific FMs for fetal plane classification. We evaluated four ultrasound FMs (USFM, MOFO, UltraSAM, FetalCLIP) against two CNN baselines (ResNet50, EfficientNet-V2) and a ViT (DINOv3) pretrained on natural images. We trained all models under two complementary settings: full fine-tuning and linear probing with a frozen encoder. All models were trained using 5-fold patient-level cross-validation on a Spanish fetal ultrasound dataset and tested on both in-domain data and an external African cohort to assess cross-population generalization. We found that FetalCLIP achieved the best results in the linear probing setting (F1 = 0.9261 for in-domain, F1 = 0.9731 for out-of-domain), while USFM performed best in the full fine-tuning setting (F1 = 0.9476 for in-domain, F1 = 0.9515 for out-of-domain). MOFO and UltraSAM degraded most in both settings, underperforming natural image pretrained models in some cases. These findings highlight how the choice of pretrained model strongly affects fetal plane classification performance, since different pretraining objectives lead to different levels of transferability.

翻译：超声因其安全性、可及性和实时成像能力在产科诊疗中广泛应用，但其判读仍高度依赖操作者经验，且易受噪声与伪影干扰。深度学习模型在解决这些问题上展现出显著优势，但通常需要大量标注数据集，这在临床超声中难以获取。基础模型（FMs）提供了一种替代方案，通过利用大量超声图像学习可迁移表征，从而在有限标注数据条件下实现泛化。本研究针对胎儿平面分类任务，提出了超声专用基础模型的全面基准测试。我们评估了四种超声基础模型（USFM、MOFO、UltraSAM、FetalCLIP），并将其与两个CNN基线模型（ResNet50、EfficientNet-V2）以及基于自然图像预训练的ViT模型（DINOv3）进行对比。所有模型均在两种互补设置下训练：全参数微调与冻结编码器的线性探测。所有模型均采用西班牙胎儿超声数据集进行患者级五折交叉验证训练，并在域内数据及外部非洲队列上测试以评估跨种群泛化能力。结果表明：在线性探测设置下FetalCLIP取得最佳性能（域内F1=0.9261，域外F1=0.9731），而在全参数微调设置下USFM表现最优（域内F1=0.9476，域外F1=0.9515）。MOFO和UltraSAM在两种设置下性能下降最明显，部分情况下甚至不及自然图像预训练模型。这些发现凸显了预训练模型的选择对胎儿平面分类性能的显著影响——不同的预训练目标会导致差异化的迁移能力。