Existing contrastive language-image pre-training aims to learn a joint representation by matching abundant image-text pairs. However, the number of image-text pairs in medical datasets is usually orders of magnitude smaller than that in natural datasets. Besides, medical image-text pairs often involve numerous complex fine-grained correspondences. This paper aims to enhance the data efficiency by introducing multiple-to-multiple local relationship modeling to capture denser supervisions. More specifically, we propose a Medical Language-Image Pre-training (MLIP) framework, which exploits the limited image-text medical data more efficiently through patch-sentence matching. Furthermore, we introduce a masked contrastive learning strategy with semantic integrity estimation to reduce redundancy in images while preserving the underlying semantics. Our evaluation results show that MLIP outperforms previous work in zero/few-shot classification and few-shot segmentation tasks by a large margin.
翻译:现有对比式语言-图像预训练旨在通过匹配丰富的图像-文本对学习联合表示。然而,医学数据集中的图像-文本对数量通常比自然数据集少几个数量级。此外,医学图像-文本对往往涉及大量复杂的细粒度对应关系。本文通过引入多对多局部关系建模以捕获更密集的监督信号,旨在提升数据效率。具体而言,我们提出医学语言图像预训练(MLIP)框架,该框架通过图像块-句子匹配更高效地利用有限的医学图像-文本数据。进一步,我们引入带语义完整性估计的掩码对比学习策略,在保留底层语义的同时减少图像冗余。评估结果表明,MLIP在零样本/少样本分类及少样本分割任务上大幅超越以往工作。