Contrastive learning based vision-language joint pre-training has emerged as a successful representation learning strategy. In this paper, we present a prototype representation learning framework incorporating both global and local alignment between medical images and reports. In contrast to standard global multi-modality alignment methods, we employ a local alignment module for fine-grained representation. Furthermore, a cross-modality conditional reconstruction module is designed to interchange information across modalities in the training phase by reconstructing masked images and reports. For reconstructing long reports, a sentence-wise prototype memory bank is constructed, enabling the network to focus on low-level localized visual and high-level clinical linguistic features. Additionally, a non-auto-regressive generation paradigm is proposed for reconstructing non-sequential reports. Experimental results on five downstream tasks, including supervised classification, zero-shot classification, image-to-text retrieval, semantic segmentation, and object detection, show the proposed method outperforms other state-of-the-art methods across multiple datasets and under different dataset size settings. The code is available at https://github.com/QtacierP/PRIOR.
翻译:基于对比学习的视觉-语言联合预训练已成为一种成功的表示学习策略。本文提出了一种原型表示学习框架,实现了医学图像与报告之间的全局和局部对齐。与标准的全局多模态对齐方法不同,我们采用局部对齐模块进行细粒度表示。此外,设计了一个跨模态条件重建模块,在训练阶段通过重建被掩码的图像和报告来实现跨模态信息交换。针对长报告重建,构建了句子级原型记忆库,使网络能够聚焦于低层局部视觉特征和高层临床语言特征。同时,提出了一种非自回归生成范式用于非序列报告的重建。在包括监督分类、零样本分类、图像到文本检索、语义分割和目标检测五项下游任务上的实验结果表明,所提方法在多数据集及不同数据集规模设置下均优于其他最先进方法。代码已开源至 https://github.com/QtacierP/PRIOR。