Most existing CLIP-style medical vision--language pretraining methods rely on global or local alignment with substantial paired data. However, global alignment is easily dominated by non-diagnostic information, while local alignment fails to integrate key diagnostic evidence. As a result, learning reliable diagnostic representations becomes difficult, which limits their applicability in medical scenarios with limited paired data. To address this issue, we propose an LLM-Guided Diagnostic Evidence Alignment method (LGDEA), which shifts the pretraining objective toward evidence-level alignment that is more consistent with the medical diagnostic process. Specifically, we leverage LLMs to extract key diagnostic evidence from radiology reports and construct a shared diagnostic evidence space, enabling evidence-aware cross-modal alignment and allowing LGDEA to effectively exploit abundant unpaired medical images and reports, thereby substantially alleviating the reliance on paired data. Extensive experimental results demonstrate that our method achieves consistent and significant improvements on phrase grounding, image--text retrieval, and zero-shot classification, and even rivals pretraining methods that rely on substantial paired data.
翻译:现有的大多数CLIP风格医学视觉-语言预训练方法依赖于大量配对数据下的全局或局部对齐。然而,全局对齐易受非诊断信息主导,而局部对齐则难以整合关键诊断证据。这导致学习可靠的诊断表征变得困难,限制了它们在配对数据有限的医学场景中的适用性。为解决这一问题,我们提出了一种大语言模型引导的诊断证据对齐方法(LGDEA),该方法将预训练目标转向与医学诊断过程更为一致的证据级对齐。具体而言,我们利用大语言模型从放射学报告中提取关键诊断证据,构建共享的诊断证据空间,从而实现证据感知的跨模态对齐,使LGDEA能够有效利用丰富的未配对医学图像与报告,从而显著减轻对配对数据的依赖。大量实验结果表明,我们的方法在短语定位、图像-文本检索和零样本分类任务上均取得了一致且显著的性能提升,其效果甚至可与依赖大量配对数据的预训练方法相媲美。