The recent surge of foundation models in computer vision and natural language processing opens up perspectives in utilizing multi-modal clinical data to train large models with strong generalizability. Yet pathological image datasets often lack biomedical text annotation and enrichment. Guiding data-efficient image diagnosis from the use of biomedical text knowledge becomes a substantial interest. In this paper, we propose to Connect Image and Text Embeddings (CITE) to enhance pathological image classification. CITE injects text insights gained from language models pre-trained with a broad range of biomedical texts, leading to adapt foundation models towards pathological image understanding. Through extensive experiments on the PatchGastric stomach tumor pathological image dataset, we demonstrate that CITE achieves leading performance compared with various baselines especially when training data is scarce. CITE offers insights into leveraging in-domain text knowledge to reinforce data-efficient pathological image classification. Code is available at https://github.com/Yunkun-Zhang/CITE.
翻译:近期计算机视觉与自然语言处理领域基础模型的蓬勃发展,为利用多模态临床数据训练具有强泛化能力的大模型开辟了新前景。然而病理图像数据集通常缺乏生物医学文本标注与信息补充。如何利用生物医学文本知识实现数据高效的图像诊断成为重要研究方向。本文提出连接图像与文本嵌入(CITE)方法以增强病理图像分类性能。CITE通过注入经海量生物医学文本预训练的语言模型所获得的文本洞见,引导基础模型适配病理图像理解。在PatchGastric胃肿瘤病理图像数据集上的大量实验表明,尤其在训练数据稀缺场景下,CITE相较于各类基线方法取得了领先性能。该工作为利用领域内文本知识增强数据高效的病理图像分类提供了新思路。相关代码已开源在https://github.com/Yunkun-Zhang/CITE。