Contrastive learning based vision-language joint pre-training has emerged as a successful representation learning strategy. In this paper, we present a prototype representation learning framework incorporating both global and local alignment between medical images and reports. In contrast to standard global multi-modality alignment methods, we employ a local alignment module for fine-grained representation. Furthermore, a cross-modality conditional reconstruction module is designed to interchange information across modalities in the training phase by reconstructing masked images and reports. For reconstructing long reports, a sentence-wise prototype memory bank is constructed, enabling the network to focus on low-level localized visual and high-level clinical linguistic features. Additionally, a non-auto-regressive generation paradigm is proposed for reconstructing non-sequential reports. Experimental results on five downstream tasks, including supervised classification, zero-shot classification, image-to-text retrieval, semantic segmentation, and object detection, show the proposed method outperforms other state-of-the-art methods across multiple datasets and under different dataset size settings. The code is available at https://github.com/QtacierP/PRIOR.
翻译:摘要: 基于对比学习的视觉-语言联合预训练已成为一种成功的表示学习策略。本文提出了一种原型表示学习框架,该框架在医学图像与报告之间实现了全局与局部对齐。与标准的全局多模态对齐方法不同,我们采用局部对齐模块以获取细粒度表示。此外,我们设计了一个跨模态条件重构模块,通过在训练阶段重构掩蔽图像和报告,实现模态间的信息交互。针对长报告的重构,构建了基于句子的原型记忆库,使网络能够聚焦于低层局部视觉特征与高层临床语言特征。同时,提出了非自回归生成范式以重构非序列化报告。在五项下游任务(包括监督分类、零样本分类、图像到文本检索、语义分割及目标检测)上的实验结果表明,所提方法在多个数据集及不同数据集规模设置下均优于其他最先进方法。代码开源地址为:https://github.com/QtacierP/PRIOR。