Ultrasound is a widely-used imaging modality critical to global healthcare, yet its interpretation remains challenging due to its varying image quality on operators, noises, and anatomical structures. Although large vision-language models (LVLMs) have demonstrated impressive multimodal capabilities across natural and medical domains, their performance on ultrasound remains largely unexplored. We introduce U2-BENCH, the first comprehensive benchmark to evaluate LVLMs on ultrasound understanding across classification, detection, regression, and text generation tasks. U2-BENCH aggregates 7,241 cases spanning 15 anatomical regions and defines 8 clinically inspired tasks, such as diagnosis, view recognition, lesion localization, clinical value estimation, and report generation, across 50 ultrasound application scenarios. We evaluate 23 state-of-the-art LVLMs, both open- and closed-source, general-purpose and medical-specific. Our results reveal strong performance on image-level classification, but persistent challenges in spatial reasoning and clinical language generation. U2-BENCH establishes a rigorous and unified testbed to assess and accelerate LVLM research in the uniquely multimodal domain of medical ultrasound imaging.
翻译:超声是一种广泛使用的成像方式,对全球医疗保健至关重要,但由于其图像质量受操作者、噪声和解剖结构影响而多变,其解读仍具挑战性。尽管大型视觉语言模型(LVLMs)在自然和医学领域已展现出令人印象深刻的多模态能力,但它们在超声方面的性能在很大程度上仍未得到探索。我们提出了U2-BENCH,这是首个用于评估LVLMs在超声理解上涵盖分类、检测、回归和文本生成任务的综合性基准。U2-BENCH汇集了跨越15个解剖区域的7,241个病例,并在50个超声应用场景中定义了8项受临床启发的任务,例如诊断、切面识别、病灶定位、临床价值估计和报告生成。我们评估了23个最先进的LVLMs,包括开源和闭源、通用型和医学专用型。我们的结果显示,在图像级分类任务上表现强劲,但在空间推理和临床语言生成方面仍存在持续挑战。U2-BENCH建立了一个严谨且统一的测试平台,用于评估和加速LVLMs在医学超声成像这一独特多模态领域的研究。