Convolutional Neural Networks (CNNs) have reigned for a decade as the de facto approach to automated medical image diagnosis, pushing the state-of-the-art in classification, detection and segmentation tasks. Over the last years, vision transformers (ViTs) have appeared as a competitive alternative to CNNs, yielding impressive levels of performance in the natural image domain, while possessing several interesting properties that could prove beneficial for medical imaging tasks. In this work, we explore the benefits and drawbacks of transformer-based models for medical image classification. We conduct a series of experiments on several standard 2D medical image benchmark datasets and tasks. Our findings show that, while CNNs perform better if trained from scratch, off-the-shelf vision transformers can perform on par with CNNs when pretrained on ImageNet, both in a supervised and self-supervised setting, rendering them as a viable alternative to CNNs.
翻译:卷积神经网络(CNN)在过去十年中一直是医学图像自动诊断的事实标准方法,推动着分类、检测和分割任务的最新技术水平。近年来,视觉Transformer(ViT)作为CNN的竞争方案出现,在自然图像领域取得了令人瞩目的性能表现,同时具备若干可能对医学成像任务有益的特性。本研究探讨了基于Transformer的模型在医学图像分类中的优势与局限。我们在多个标准二维医学图像基准数据集和任务上开展系列实验。研究结果表明,尽管从头训练时CNN表现更优,但现成的视觉Transformer在ImageNet上进行预训练后(无论是在监督还是自监督设置下)均可达到与CNN相当的性能,从而成为CNN的可行替代方案。