Contrastive Language-Image Pre-training (CLIP) shows promise in medical image analysis but requires substantial data and computational resources. Due to these restrictions, existing CLIP applications in medical imaging focus mainly on modalities like chest X-rays that have abundant image-report data available, leaving many other important modalities under-explored. Here, we propose the first adaptation of the full CLIP model to mammography, which presents significant challenges due to labeled data scarcity, high-resolution images with small regions of interest, and data imbalance. We first develop a specialized supervision framework for mammography that leverages its multi-view nature. Furthermore, we design a symmetric local alignment module to better focus on detailed features in high-resolution images. Lastly, we incorporate a parameter-efficient fine-tuning approach for large language models pre-trained with medical knowledge to address data limitations. Our multi-view and multi-scale alignment (MaMA) method outperforms state-of-the-art baselines for three different tasks on two large real-world mammography datasets, EMBED and RSNA-Mammo, with only 52% model size compared with the largest baseline.
翻译:对比语言-图像预训练(CLIP)在医学影像分析中展现出潜力,但其需要大量数据与计算资源。受限于此,现有医学影像中的CLIP应用主要集中于胸部X光等具备充足影像-报告数据的模态,而许多其他重要模态尚未得到充分探索。本文首次将完整CLIP模型适配于乳腺X线摄影领域——该领域因标注数据稀缺、高分辨率图像中感兴趣区域较小以及数据不平衡等问题而面临显著挑战。我们首先开发了针对乳腺X线摄影多视角特性的专用监督框架。进一步,我们设计了对称局部对齐模块以更好地聚焦高分辨率图像中的细节特征。最后,我们采用基于医学知识预训练的大型语言模型的参数高效微调策略以应对数据限制。我们的多视角与多尺度对齐(MaMA)方法在两个大型真实世界乳腺X线摄影数据集(EMBED与RSNA-Mammo)上的三项不同任务中均优于现有最优基线模型,且模型参数量仅为最大基线的52%。