Convolutional Neural Networks (CNNs) have reigned for a decade as the de facto approach to automated medical image diagnosis, pushing the state-of-the-art in classification, detection and segmentation tasks. Over the last years, vision transformers (ViTs) have appeared as a competitive alternative to CNNs, yielding impressive levels of performance in the natural image domain, while possessing several interesting properties that could prove beneficial for medical imaging tasks. In this work, we explore the benefits and drawbacks of transformer-based models for medical image classification. We conduct a series of experiments on several standard 2D medical image benchmark datasets and tasks. Our findings show that, while CNNs perform better if trained from scratch, off-the-shelf vision transformers can perform on par with CNNs when pretrained on ImageNet, both in a supervised and self-supervised setting, rendering them as a viable alternative to CNNs.
翻译:卷积神经网络(CNN)在自动化医学影像诊断领域作为事实标准方法已主导十年,不断推动分类、检测与分割任务的技术前沿。近年来,视觉Transformer(ViT)作为CNN的竞争性替代方案出现,在自然图像领域展现出卓越性能,同时具备若干可能有益于医学影像任务的特性。本研究系统探究基于Transformer的模型在医学图像分类中的优势与局限。我们在多个标准二维医学影像基准数据集与任务上开展系列实验。结果表明:虽然从头训练的CNN表现更优,但基于ImageNet预训练的即用型视觉Transformer在监督与自监督设置下均可达到与CNN相当的性能,使其成为可行的替代方案。