Medical images are often more difficult to acquire than natural images due to the specialism of the equipment and technology, which leads to less medical image datasets. So it is hard to train a strong pretrained medical vision model. How to make the best of natural pretrained vision model and adapt in medical domain still pends. For image classification, a popular method is linear probe (LP). However, LP only considers the output after feature extraction. Yet, there exists a gap between input medical images and natural pretrained vision model. We introduce visual prompting (VP) to fill in the gap, and analyze the strategies of coupling between LP and VP. We design a joint learning loss function containing categorisation loss and discrepancy loss, which describe the variance of prompted and plain images, naming this joint training strategy MoVL (Mixture of Visual Prompting and Linear Probe). We experiment on 4 medical image classification datasets, with two mainstream architectures, ResNet and CLIP. Results shows that without changing the parameters and architecture of backbone model and with less parameters, there is potential for MoVL to achieve full finetune (FF) accuracy (on four medical datasets, average 90.91% for MoVL and 91.13% for FF). On out of distribution medical dataset, our method(90.33%) can outperform FF (85.15%) with absolute 5.18 % lead.
翻译:由于医疗设备和技术的专业性,医学图像往往比自然图像更难获取,导致医学图像数据集较少。因此,训练一个强大的预训练医学视觉模型较为困难。如何充分利用自然预训练视觉模型并使其适应医学领域仍是一个待解决的问题。在图像分类任务中,线性探测(LP)是一种流行的方法。然而,LP仅考虑特征提取后的输出,而输入医学图像与自然预训练视觉模型之间存在差异。我们引入视觉提示(VP)来填补这一差距,并分析了LP与VP耦合的策略。我们设计了一个包含分类损失和差异损失的联合学习损失函数,用于描述提示图像与原始图像的差异,将此联合训练策略命名为MoVL(视觉提示与线性探测混合)。我们在四个医学图像分类数据集上进行了实验,采用两种主流架构:ResNet和CLIP。结果表明,在不改变骨干模型的参数和架构且使用较少参数的情况下,MoVL有潜力达到全微调(FF)的精度(在四个医学数据集上,MoVL平均为90.91%,FF为91.13%)。在非分布医学数据集上,我们的方法(90.33%)可超越FF(85.15%),绝对领先5.18%。