Localized image captioning has made significant progress with models like the Describe Anything Model (DAM), which can generate detailed region-specific descriptions without explicit region-text supervision. However, such capabilities have yet to be widely applied to specialized domains like medical imaging, where diagnostic interpretation relies on subtle regional findings rather than global understanding. To mitigate this gap, we propose MedDAM, the first comprehensive framework leveraging large vision-language models for region-specific captioning in medical images. MedDAM employs medical expert-designed prompts tailored to specific imaging modalities and establishes a robust evaluation benchmark comprising a customized assessment protocol, data pre-processing pipeline, and specialized QA template library. This benchmark evaluates both MedDAM and other adaptable large vision-language models, focusing on clinical factuality through attribute-level verification tasks, thereby circumventing the absence of ground-truth region-caption pairs in medical datasets. Extensive experiments on the VinDr-CXR, LIDC-IDRI, and SkinCon datasets demonstrate MedDAM's superiority over leading peers (including GPT-4o, Claude 3.7 Sonnet, LLaMA-3.2 Vision, Qwen2.5-VL, GPT-4Rol, and OMG-LLaVA) in the task, revealing the importance of region-level semantic alignment in medical image understanding and establishing MedDAM as a promising foundation for clinical vision-language integration.
翻译:局部图像描述已取得显著进展,例如描述任意模型(DAM)等模型能够在无需显式区域-文本监督的情况下生成详细的区域特定描述。然而,此类能力尚未广泛应用于医学成像等专业领域,在这些领域中,诊断解释依赖于细微的区域性发现而非全局理解。为弥补这一差距,我们提出了MedDAM,这是首个利用大型视觉-语言模型进行医学图像区域特定描述的综合框架。MedDAM采用针对特定成像模态设计的医学专家提示,并建立了一个包含定制化评估协议、数据预处理流程和专用QA模板库的鲁棒评估基准。该基准评估了MedDAM及其他可适应的大型视觉-语言模型,重点通过属性级验证任务关注临床事实性,从而规避了医学数据集中缺乏真实区域-描述对的问题。在VinDr-CXR、LIDC-IDRI和SkinCon数据集上的大量实验表明,MedDAM在该任务中优于包括GPT-4o、Claude 3.7 Sonnet、LLaMA-3.2 Vision、Qwen2.5-VL、GPT-4Rol和OMG-LLaVA在内的领先模型,揭示了区域级语义对齐在医学图像理解中的重要性,并确立了MedDAM作为临床视觉-语言集成的一个有前景的基础。