Learning medical visual representations directly from paired images and reports through multimodal self-supervised learning has emerged as a novel and efficient approach to digital diagnosis in recent years. However, existing models suffer from several severe limitations. 1) neglecting the selection of negative samples, resulting in the scarcity of hard negatives and the inclusion of false negatives; 2) focusing on global feature extraction, but overlooking the fine-grained local details that are crucial for medical image recognition tasks; and 3) contrastive learning primarily targets high-level features but ignoring low-level details which are essential for accurate medical analysis. Motivated by these critical issues, this paper presents a Cross-Modal Cluster-Guided Negative Sampling (CM-CGNS) method with two-fold ideas. First, it extends the k-means clustering used for local text features in the single-modal domain to the multimodal domain through cross-modal attention. This improvement increases the number of negative samples and boosts the model representation capability. Second, it introduces a Cross-Modal Masked Image Reconstruction (CM-MIR) module that leverages local text-to-image features obtained via cross-modal attention to reconstruct masked local image regions. This module significantly strengthens the model's cross-modal information interaction capabilities and retains low-level image features essential for downstream tasks. By well handling the aforementioned limitations, the proposed CM-CGNS can learn effective and robust medical visual representations suitable for various recognition tasks. Extensive experimental results on classification, detection, and segmentation tasks across five downstream datasets show that our method outperforms state-of-the-art approaches on multiple metrics, verifying its superior performance.
翻译:近年来,通过多模态自监督学习直接从配对的影像与报告中学习医学视觉表征,已成为数字诊断领域一种新颖且高效的方法。然而,现有模型存在若干严重局限:1)忽视负样本的选择,导致困难负样本稀缺并可能包含假阴性样本;2)侧重于全局特征提取,但忽略了对于医学影像识别任务至关重要的细粒度局部细节;3)对比学习主要针对高级特征,却忽视了对于精确医学分析不可或缺的低级细节。受这些关键问题启发,本文提出了一种跨模态聚类引导负采样(CM-CGNS)方法,其核心包含两方面创新。首先,该方法通过跨模态注意力机制,将单模态领域中用于局部文本特征的k-means聚类扩展至多模态领域。这一改进增加了负样本数量,并提升了模型表征能力。其次,该方法引入了一个跨模态掩码图像重建(CM-MIR)模块,该模块利用通过跨模态注意力获得的局部文本到图像特征来重建被掩码的局部图像区域。该模块显著增强了模型的跨模态信息交互能力,并保留了对下游任务至关重要的低级图像特征。通过有效处理上述局限,所提出的CM-CGNS能够学习到适用于各种识别任务的有效且鲁棒的医学视觉表征。在五个下游数据集上进行的分类、检测和分割任务的广泛实验结果表明,我们的方法在多项指标上优于现有最先进方法,验证了其卓越性能。