Medical vision-language models enable co-learning and integrating features from medical imaging and clinical text. However, these models are not easy to train and the latent representation space can be complex. Here we propose a novel way for pre-training and regularising medical vision-language models. The proposed method, named Medical vision-language pre-training with Frozen language models and Latent spAce Geometry optimization (M-FLAG), leverages a frozen language model for training stability and efficiency and introduces a novel orthogonality loss to harmonize the latent space geometry. We demonstrate the potential of the pre-trained model on three downstream tasks: medical image classification, segmentation, and object detection. Extensive experiments across five public datasets demonstrate that M-FLAG significantly outperforms existing medical vision-language pre-training approaches and reduces the number of parameters by 78\%. Notably, M-FLAG achieves outstanding performance on the segmentation task while using only 1\% of the RSNA dataset, even outperforming ImageNet pre-trained models that have been fine-tuned using 100\% of the data.
翻译:医学视觉-语言模型能够实现医学影像与临床文本特征的协同学习与整合。然而,这类模型训练难度较高,且其潜在表示空间可能呈现复杂特性。本文提出一种新型的医学视觉-语言模型预训练与正则化方法。该方法名为"基于冻结语言模型与潜在空间几何优化的医学视觉-语言预训练"(M-FLAG),通过引入冻结语言模型来提升训练稳定性与效率,同时创新性地采用正交性损失函数以协调潜在空间几何结构。我们通过医学图像分类、分割与目标检测三项下游任务验证了预训练模型的潜力。在五个公开数据集上的大量实验表明,M-FLAG显著优于现有医学视觉-语言预训练方法,并将参数量缩减78%。尤为突出的是,M-FLAG在仅使用RSNA数据集1%样本的分割任务中即取得卓越性能,甚至超越使用100%数据微调的ImageNet预训练模型。