Vision-Language Models (VLMs) can generate convincing clinical narratives, yet frequently struggle to visually ground their statements. We posit this limitation arises from the scarcity of high-quality, large-scale clinical referring-localization pairs. To address this, we introduce MedGround, an automated pipeline that transforms segmentation resources into high-quality medical referring grounding data. Leveraging expert masks as spatial anchors, MedGround precisely derives localization targets, extracts shape and spatial cues, and guides VLMs to synthesize natural, clinically grounded queries that reflect morphology and location. To ensure data rigor, a multi-stage verification system integrates strict formatting checks, geometry- and medical-prior rules, and image-based visual judging to filter out ambiguous or visually unsupported samples. Finally, we present MedGround-35K, a novel multimodal medical dataset. Extensive experiments demonstrate that VLMs trained with MedGround-35K consistently achieve improved referring grounding performance, enhance multi-object semantic disambiguation, and exhibit strong generalization to unseen grounding settings. This work highlights MedGround as a scalable, data-driven approach to anchor medical language to verifiable visual evidence. Dataset and code will be released publicly upon acceptance.
翻译:视觉语言模型(VLMs)能够生成具有说服力的临床叙述,但常常难以对其陈述进行视觉定位。我们认为这一局限源于高质量、大规模的临床指代-定位数据对的稀缺。为解决此问题,我们提出了MedGround,一个将分割资源转化为高质量医学指代定位数据的自动化流程。MedGround利用专家标注的掩码作为空间锚点,精确推导定位目标,提取形状与空间线索,并引导VLMs合成反映形态与位置的自然、具有临床依据的查询。为确保数据严谨性,一个多阶段验证系统集成了严格的格式检查、几何与医学先验规则以及基于图像的视觉判断,以过滤掉模糊或缺乏视觉支持的样本。最终,我们提出了MedGround-35K,一个新颖的多模态医学数据集。大量实验表明,使用MedGround-35K训练的VLMs在指代定位任务上持续获得性能提升,增强了多目标语义消歧能力,并对未见过的定位场景展现出强大的泛化能力。本工作凸显了MedGround作为一种可扩展的、数据驱动的方法,能够将医学语言锚定于可验证的视觉证据。数据集与代码将在论文录用后公开。