Although purely transformer-based architectures showed promising performance in many computer vision tasks, many hybrid models consisting of CNN and transformer blocks are introduced to fit more specialized tasks. Nevertheless, despite the performance gain of both pure and hybrid transformer-based architectures compared to CNNs in medical imaging segmentation, their high training cost and complexity make it challenging to use them in real scenarios. In this work, we propose simple architectures based on purely convolutional layers, and show that by just taking advantage of the attention map visualizations obtained from a self-supervised pretrained vision transformer network (e.g., DINO) one can outperform complex transformer-based networks with much less computation costs. The proposed architecture is composed of two encoder branches with the original image as input in one branch and the attention map visualizations of the same image from multiple self-attention heads from a pre-trained DINO model (as multiple channels) in the other branch. The results of our experiments on two publicly available medical imaging datasets show that the proposed pipeline outperforms U-Net and the state-of-the-art medical image segmentation models.
翻译:尽管纯Transformer架构在众多计算机视觉任务中展现出卓越性能,但为适配更专业的任务,许多由CNN和Transformer模块组成的混合模型应运而生。然而,在医学图像分割领域,与CNN相比,纯Transformer和混合Transformer架构虽能提升性能,但其高昂的训练成本和复杂度使其在真实场景中的应用面临挑战。本文提出基于纯卷积层的简单架构,并证明仅通过利用自监督预训练视觉Transformer网络(例如DINO)获得的注意力图可视化结果,即可在远低于复杂Transformer网络的计算成本下超越同类模型。该架构包含两个编码分支:一个分支以原始图像为输入,另一分支则将来自预训练DINO模型多个自注意力头的同一图像的注意力图可视化结果(作为多通道)作为输入。我们在两个公开医学影像数据集上的实验结果表明,所提方法优于U-Net及当前最先进的医学图像分割模型。