ECAMP: Entity-centered Context-aware Medical Vision Language Pre-training

Despite significant advancements in medical vision-language pre-training, existing methods have largely overlooked the inherent entity-specific context within radiology reports and the complex cross-modality contextual relationships between text and images. To close this gap, we propose a novel Entity-centered Context-aware Medical Vision-language Pre-training (ECAMP) framework, which is designed to enable a more entity-centered and context-sensitive interpretation of medical data. Utilizing the recent powerful large language model, we distill entity-centered context from medical reports, which enables ECAMP to gain more effective supervision from the text modality. By further pre-training our model with carefully designed entity-aware, context-enhanced masked language modeling and context-guided super-resolution tasks, ECAMP significantly refines the interplay between text and image modalities, leading to an enhanced ability to extract entity-centered contextual features. Besides, our proposed multi-scale context fusion design also improves the semantic integration of both coarse and fine-level image representations, prompting better performance for multi-scale downstream applications. Combining these components leads to significant performance leaps over current state-of-the-art methods and establishes a new standard for cross-modality learning in medical imaging, whose effectiveness is demonstrated by our extensive experiments on various tasks including classification, segmentation, and detection across several public datasets. Code and models are available at https://github.com/ToniChopp/ECAMP.

翻译：尽管医学视觉语言预训练取得了显著进展，现有方法在很大程度上忽视了放射学报告中固有的实体特定上下文以及文本与图像之间复杂的跨模态上下文关系。为填补这一空白，我们提出了一种新颖的以实体为中心的上下文感知医学视觉语言预训练（ECAMP）框架，旨在实现对医学数据的更具实体中心性和上下文敏感性的解读。利用近期强大的大型语言模型，我们从医学报告中提炼出以实体为中心的上下文，使ECAMP能够从文本模态中获得更有效的监督。通过进一步使用精心设计的实体感知、上下文增强的掩码语言建模和上下文引导的超分辨率任务对模型进行预训练，ECAMP显著优化了文本与图像模态之间的交互，从而增强了提取以实体为中心的上下文特征的能力。此外，我们提出的多尺度上下文融合设计还改善了粗粒度和细粒度图像表示的语义整合，促进了多尺度下游应用的性能提升。这些组件的结合使得当前最先进方法取得了重大性能飞跃，并为医学影像中的跨模态学习确立了新标准，我们在多个公开数据集上的分类、分割和检测等各项任务的广泛实验证明了其有效性。代码和模型可在 https://github.com/ToniChopp/ECAMP 获取。