MedFocusCLIP : Improving few shot classification in medical datasets using pixel wise attention

With the popularity of foundational models, parameter efficient fine tuning has become the defacto approach to leverage pretrained models to perform downstream tasks. Taking inspiration from recent advances in large language models, Visual Prompt Tuning, and similar techniques, learn an additional prompt to efficiently finetune a pretrained vision foundational model. However, we observe that such prompting is insufficient for fine-grained visual classification tasks such as medical image classification, where there is large inter-class variance, and small intra-class variance. Hence, in this paper we propose to leverage advanced segmentation capabilities of Segment Anything Model 2 (SAM2) as a visual prompting cue to help visual encoder in the CLIP (Contrastive Language-Image Pretraining) by guiding the attention in CLIP visual encoder to relevant regions in the image. This helps the model to focus on highly discriminative regions, without getting distracted from visually similar background features, an essential requirement in a fewshot, finegrained classification setting. We evaluate our method on diverse medical datasets including X-rays, CT scans, and MRI images, and report an accuracy of (71%, 81%, 86%, 58%) from the proposed approach on (COVID, lung-disease, brain-tumor, breast-cancer) datasets against (66%, 70%, 68%, 29%) from a pretrained CLIP model after fewshot training. The proposed approach also allows to obtain interpretable explanation for the classification performance through the localization obtained using segmentation.

翻译：随着基础模型的普及，参数高效微调已成为利用预训练模型执行下游任务的事实标准方法。受大语言模型、视觉提示调优及类似技术最新进展的启发，现有方法通过学习额外提示来高效微调预训练的视觉基础模型。然而，我们观察到此类提示方法对于细粒度视觉分类任务（如医学图像分类）存在不足，因为这类任务通常具有较大的类间差异和较小的类内差异。为此，本文提出利用Segment Anything Model 2（SAM2）的先进分割能力作为视觉提示线索，通过引导CLIP（对比语言-图像预训练）视觉编码器关注图像中的相关区域来增强其表征能力。这种方法使模型能够聚焦于高判别性区域，避免被视觉相似的背景特征干扰，这在少样本细粒度分类场景中至关重要。我们在包含X光、CT扫描和MRI图像在内的多种医学数据集上评估了所提方法，在（COVID、肺部疾病、脑肿瘤、乳腺癌）数据集上分别获得（71%、81%、86%、58%）的准确率，而经过少样本训练的预训练CLIP模型仅获得（66%、70%、68%、29%）的准确率。所提方法还能通过分割得到的定位结果，为分类性能提供可解释性分析依据。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

《用于无线通信和传感的智能反射面 (IRS)》（ICC 2022）新加坡国立大学2022最新53页slides

专知会员服务

25+阅读 · 2022年11月16日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日