The deep learning field is converging towards the use of general foundation models that can be easily adapted for diverse tasks. While this paradigm shift has become common practice within the field of natural language processing, progress has been slower in computer vision. In this paper we attempt to address this issue by investigating the transferability of various state-of-the-art foundation models to medical image classification tasks. Specifically, we evaluate the performance of five foundation models, namely SAM, SEEM, DINOv2, BLIP, and OpenCLIP across four well-established medical imaging datasets. We explore different training settings to fully harness the potential of these models. Our study shows mixed results. DINOv2 consistently outperforms the standard practice of ImageNet pretraining. However, other foundation models failed to consistently beat this established baseline indicating limitations in their transferability to medical image classification tasks.
翻译:深度学习领域正趋向于使用通用基础模型,这些模型可以轻松适配各种任务。虽然这种范式转变在自然语言处理领域已成为常见做法,但在计算机视觉领域进展较为缓慢。本文旨在通过研究多种最先进基础模型向医学图像分类任务的可迁移性来解决这一问题。具体而言,我们评估了五种基础模型(即SAM、SEEM、DINOv2、BLIP和OpenCLIP)在四个公认的医学影像数据集上的性能。我们探索了不同的训练设置,以充分挖掘这些模型的潜力。研究结果呈现混合态势:DINOv2始终优于ImageNet预训练的标准做法,而其他基础模型未能持续超越这一既定基线,表明它们向医学图像分类任务的可迁移性存在局限性。