FUSQA: Fetal Ultrasound Segmentation Quality Assessment

Deep learning models have been effective for various fetal ultrasound segmentation tasks. However, generalization to new unseen data has raised questions about their effectiveness for clinical adoption. Normally, a transition to new unseen data requires time-consuming and costly quality assurance processes to validate the segmentation performance post-transition. Segmentation quality assessment efforts have focused on natural images, where the problem has been typically formulated as a dice score regression task. In this paper, we propose a simplified Fetal Ultrasound Segmentation Quality Assessment (FUSQA) model to tackle the segmentation quality assessment when no masks exist to compare with. We formulate the segmentation quality assessment process as an automated classification task to distinguish between good and poor-quality segmentation masks for more accurate gestational age estimation. We validate the performance of our proposed approach on two datasets we collect from two hospitals using different ultrasound machines. We compare different architectures, with our best-performing architecture achieving over 90% classification accuracy on distinguishing between good and poor-quality segmentation masks from an unseen dataset. Additionally, there was only a 1.45-day difference between the gestational age reported by doctors and estimated based on CRL measurements using well-segmented masks. On the other hand, this difference increased and reached up to 7.73 days when we calculated CRL from the poorly segmented masks. As a result, AI-based approaches can potentially aid fetal ultrasound segmentation quality assessment and might detect poor segmentation in real-time screening in the future.

翻译：深度学习模型已有效应用于多种胎儿超声分割任务。然而，针对新未知数据的泛化能力引发了对其临床推广有效性的质疑。通常情况下，向新未知数据的迁移需要耗时且昂贵的数据质量保证流程来验证迁移后的分割性能。现有的分割质量评估研究主要集中于自然图像，其问题通常被构造成Dice分数回归任务。本文提出了一种简化的胎儿超声分割质量评估（FUSQA）模型，用于解决无真实掩膜可对照场景下的分割质量评估问题。我们将分割质量评估过程形式化为自动化分类任务，通过区分优质与劣质分割掩膜实现更精准的胎龄估计。我们使用两台不同医院采用不同超声设备采集的数据集验证了所提方法的性能。通过对比不同网络架构，最优架构在新未知数据集上区分优质与劣质分割掩膜的准确率超过90%。此外，基于优质分割掩膜的头臀长（CRL）测量值估计的胎龄与医生报告值仅相差1.45天。相反，使用劣质分割掩膜计算CRL时，该差异增大至7.73天。因此，基于人工智能的方法有望辅助胎儿超声分割质量评估，并可能在未来的实时筛查中检测出劣质分割结果。