With the popularity of foundational models, parameter efficient fine tuning has become the defacto approach to leverage pretrained models to perform downstream tasks. Taking inspiration from recent advances in large language models, Visual Prompt Tuning, and similar techniques, learn an additional prompt to efficiently finetune a pretrained vision foundational model. However, we observe that such prompting is insufficient for fine-grained visual classification tasks such as medical image classification, where there is large inter-class variance, and small intra-class variance. Hence, in this paper we propose to leverage advanced segmentation capabilities of Segment Anything Model 2 (SAM2) as a visual prompting cue to help visual encoder in the CLIP (Contrastive Language-Image Pretraining) by guiding the attention in CLIP visual encoder to relevant regions in the image. This helps the model to focus on highly discriminative regions, without getting distracted from visually similar background features, an essential requirement in a fewshot, finegrained classification setting. We evaluate our method on diverse medical datasets including X-rays, CT scans, and MRI images, and report an accuracy of (71%, 81%, 86%, 58%) from the proposed approach on (COVID, lung-disease, brain-tumor, breast-cancer) datasets against (66%, 70%, 68%, 29%) from a pretrained CLIP model after fewshot training. The proposed approach also allows to obtain interpretable explanation for the classification performance through the localization obtained using segmentation.
翻译:随着基础模型的普及,参数高效微调已成为利用预训练模型执行下游任务的事实标准方法。受大语言模型、视觉提示调优及类似技术最新进展的启发,现有方法通过学习额外提示来高效微调预训练的视觉基础模型。然而,我们观察到此类提示方法对于细粒度视觉分类任务(如医学图像分类)存在不足,因为这类任务通常具有较大的类间差异和较小的类内差异。为此,本文提出利用Segment Anything Model 2(SAM2)的先进分割能力作为视觉提示线索,通过引导CLIP(对比语言-图像预训练)视觉编码器关注图像中的相关区域来增强其表征能力。这种方法使模型能够聚焦于高判别性区域,避免被视觉相似的背景特征干扰,这在少样本细粒度分类场景中至关重要。我们在包含X光、CT扫描和MRI图像在内的多种医学数据集上评估了所提方法,在(COVID、肺部疾病、脑肿瘤、乳腺癌)数据集上分别获得(71%、81%、86%、58%)的准确率,而经过少样本训练的预训练CLIP模型仅获得(66%、70%、68%、29%)的准确率。所提方法还能通过分割得到的定位结果,为分类性能提供可解释性分析依据。