Convolutional Neural Networks (CNNs) have reigned for a decade as the de facto approach to automated medical image diagnosis, pushing the state-of-the-art in classification, detection and segmentation tasks. Over the last years, vision transformers (ViTs) have appeared as a competitive alternative to CNNs, yielding impressive levels of performance in the natural image domain, while possessing several interesting properties that could prove beneficial for medical imaging tasks. In this work, we explore the benefits and drawbacks of transformer-based models for medical image classification. We conduct a series of experiments on several standard 2D medical image benchmark datasets and tasks. Our findings show that, while CNNs perform better if trained from scratch, off-the-shelf vision transformers can perform on par with CNNs when pretrained on ImageNet, both in a supervised and self-supervised setting, rendering them as a viable alternative to CNNs.
翻译:卷积神经网络(CNNs)十年来一直作为自动化医学图像诊断的事实标准方法,推动着分类、检测和分割任务的最新进展。近年来,视觉Transformer(ViTs)作为CNNs的有力竞争者出现,在自然图像领域展现出令人瞩目的性能水平,同时具备若干可能对医学成像任务有益的显著特性。本研究探索了基于Transformer的模型在医学图像分类中的优缺点。我们在多个标准二维医学图像基准数据集和任务上开展了一系列实验。结果表明:尽管从头训练的CNNs表现更优,但经过ImageNet预训练的现成视觉Transformer在监督学习和自监督学习场景下均可达到与CNNs相当的性能,使其成为CNNs的可行替代方案。