Towards Reliable Fetal Ultrasound Interpretation with Multi-Agent Collaboration

Xiaotian Hu,Mingxuan Liu,Junwei Huang,Kasidit Anmahapong,Yifei Chen,Yiming Huang,Xuguang Bai,Zihan Li,Hongjia Yang,Yingqi Hao,Hong Xu,Yu Jiang,Tian Tian,Yi Liao,Haibo Qu,Qiyuan Tian

Automated fetal ultrasound interpretation requires a workflow from visual perception, including plane recognition and anatomical segmentation, to clinical understanding, including biometric measurement and diagnostic reporting. However, the prevailing "one-task, one-model" paradigm limits systematic integration of evidence across this multi-step process. Although multimodal large language models (MLLMs) show promising visual understanding, their limited domain-specific grounding and hallucination risks restrict reliability in fetal ultrasound analysis. To address these limitations, we propose FetUSAgents, a tool-augmented multi-agent system for comprehensive fetal ultrasound interpretation, supporting visual question answering (VQA), report generation, image captioning, and video summarization. FetUSAgents coordinates task-specific visual tools through collaborative LLM agents and decomposes clinical queries into subtasks that progress from anatomical recognition to quantitative measurement. We further introduce Dual-Path Evidence Arbitration (DPEA), which integrates LLM-based deliberative reasoning with structured computational evidence from specialized visual tools. A retrieval-enhanced evidence bank consolidates intermediate findings to support traceable and clinically grounded conclusions. In addition, we construct FetUS-VQA, a dedicated VQA benchmark for fetal ultrasound, comprising 1,892 images and 3,205 question-answer pairs across 10 clinical tasks. Extensive out-of-distribution experiments show that FetUSAgents outperforms general and medical MLLMs, exceeding the strongest baseline by more than 25 percent in VQA accuracy. These results suggest a scalable route toward evidence-driven clinical assistants for prenatal imaging. Code is available.

翻译：自动化的胎儿超声解读需要从视觉感知（包括平面识别和解剖结构分割）到临床理解（包括生物测量和诊断报告生成）的工作流程。然而，当前主流的“单任务-单模型”范式限制了证据在多步骤流程中的系统性整合。尽管多模态大语言模型（MLLMs）展现出有前景的视觉理解能力，但其有限的领域特定知识和幻觉风险制约了其在胎儿超声分析中的可靠性。为解决这些局限，我们提出了FetUSAgents——一种工具增强的多智能体系统，支持视觉问答（VQA）、报告生成、图像描述和视频摘要等全面的胎儿超声解读功能。FetUSAgents通过协作式LLM智能体协调任务特定的视觉工具，将临床查询分解为从解剖识别到定量测量的子任务。我们进一步引入了双路径证据仲裁（DPEA），将基于LLM的审慎推理与来自专门视觉工具的结构化计算证据相结合。一个检索增强的证据库整合了中间发现，以支持可追溯且具有临床依据的结论。此外，我们构建了FetUS-VQA——一个专门用于胎儿超声的VQA基准数据集，包含1,892张图像和3,205组问答对，覆盖10项临床任务。广泛的分布外实验表明，FetUSAgents的性能优于通用和医学MLLMs，在VQA准确率上超过最强基线模型25%以上。这些结果预示着通往产前影像证据驱动型临床助手的可扩展路径。代码已开源。