Contrastive Language-Image Pre-training (CLIP) demonstrates strong potential in medical image analysis but requires substantial data and computational resources. Due to these restrictions, existing CLIP applications in medical imaging focus mainly on modalities like chest X-rays that have abundant image-report data available, leaving many other important modalities underexplored. Here, we propose one of the first adaptations of the full CLIP model to mammography, which presents significant challenges due to labeled data scarcity, high-resolution images with small regions of interest, and class-wise imbalance. We first develop a specialized supervision framework for mammography that leverages its multi-view nature. Furthermore, we design a symmetric local alignment module to better focus on detailed features in high-resolution images. Lastly, we incorporate a parameter-efficient fine-tuning approach for large language models pre-trained with medical knowledge to address data limitations. Our multi-view and multi-scale alignment (MaMA) method outperforms state-of-the-art baselines for three different tasks on two large real-world mammography datasets, EMBED and RSNA-Mammo, with only 52% model size compared with the largest baseline. The code is available at https://github.com/XYPB/MaMA
翻译:对比语言-图像预训练(CLIP)在医学图像分析中展现出巨大潜力,但其需要大量数据与计算资源。受限于此,现有医学影像领域的CLIP应用主要集中于胸部X光等具备充足图像-报告数据的模态,而许多其他重要模态尚未得到充分探索。本文首次将完整CLIP模型适配于乳腺X线摄影领域,该任务面临标注数据稀缺、高分辨率图像中感兴趣区域较小以及类别不平衡等显著挑战。我们首先针对乳腺X线摄影的多视图特性开发了专用监督框架。进一步,我们设计了对称局部对齐模块以更好地聚焦高分辨率图像中的细节特征。最后,我们采用参数高效微调方法,利用经医学知识预训练的大语言模型应对数据限制问题。在EMBED和RSNA-Mammo两个大型真实世界乳腺X线摄影数据集上,我们提出的多视图与多尺度对齐(MaMA)方法在三种不同任务中均优于现有最优基线模型,且模型参数量仅为最大基线的52%。代码已发布于https://github.com/XYPB/MaMA。