Medical vision-language models enable co-learning and integrating features from medical imaging and clinical text. However, these models are not easy to train and the latent representation space can be complex. Here we propose a novel way for pre-training and regularising medical vision-language models. The proposed method, named Medical vision-language pre-training with Frozen language models and Latent spAce Geometry optimization (M-FLAG), leverages a frozen language model for training stability and efficiency and introduces a novel orthogonality loss to harmonize the latent space geometry. We demonstrate the potential of the pre-trained model on three downstream tasks: medical image classification, segmentation, and object detection. Extensive experiments across five public datasets demonstrate that M-FLAG significantly outperforms existing medical vision-language pre-training approaches and reduces the number of parameters by 78\%. Notably, M-FLAG achieves outstanding performance on the segmentation task while using only 1\% of the RSNA dataset, even outperforming ImageNet pre-trained models that have been fine-tuned using 100\% of the data.
翻译:医学视觉-语言模型能够实现医学影像与临床文本特征的协同学习与整合。然而,这类模型不仅训练困难,其潜在表示空间的结构也可能极为复杂。本文提出了一种新颖的医学视觉-语言模型预训练与正则化方法。所提出的方法名为“基于冻结语言模型与潜在空间几何优化的医学视觉-语言预训练”(M-FLAG),通过利用冻结语言模型提升训练稳定性与效率,并引入新颖的正交性损失函数以协调潜在空间的几何结构。我们通过三个下游任务(医学图像分类、分割与目标检测)验证了该预训练模型的潜力。在五个公开数据集上的大量实验表明,M-FLAG显著优于现有医学视觉-语言预训练方法,并将参数数量减少78%。值得注意的是,在仅使用RSNA数据集1%数据量的情况下,M-FLAG在分割任务中取得了卓越性能,甚至超越了使用100%数据进行微调的ImageNet预训练模型。