Given the prevalence of 3D medical imaging technologies such as MRI and CT that are widely used in diagnosing and treating diverse diseases, 3D segmentation is one of the fundamental tasks of medical image analysis. Recently, Transformer-based models have started to achieve state-of-the-art performances across many vision tasks, through pre-training on large-scale natural image benchmark datasets. While works on medical image analysis have also begun to explore Transformer-based models, there is currently no optimal strategy to effectively leverage pre-trained Transformers, primarily due to the difference in dimensionality between 2D natural images and 3D medical images. Existing solutions either split 3D images into 2D slices and predict each slice independently, thereby losing crucial depth-wise information, or modify the Transformer architecture to support 3D inputs without leveraging pre-trained weights. In this work, we use a simple yet effective weight inflation strategy to adapt pre-trained Transformers from 2D to 3D, retaining the benefit of both transfer learning and depth information. We further investigate the effectiveness of transfer from different pre-training sources and objectives. Our approach achieves state-of-the-art performances across a broad range of 3D medical image datasets, and can become a standard strategy easily utilized by all work on Transformer-based models for 3D medical images, to maximize performance.
翻译:鉴于MRI和CT等3D医学成像技术在诊断和治疗多种疾病中的广泛应用,3D分割是医学图像分析的基础任务之一。近年来,基于Transformer的模型通过在大型自然图像基准数据集上进行预训练,在众多视觉任务中开始取得最先进的性能。尽管医学图像分析领域也逐步探索基于Transformer的模型,但由于2D自然图像与3D医学图像之间的维度差异,目前尚无有效利用预训练Transformer的最优策略。现有解决方案要么将3D图像分割为2D切片并独立预测每个切片,从而丢失关键的深度信息;要么修改Transformer架构以支持3D输入,但无法利用预训练权重。在本工作中,我们采用一种简单而有效的权重膨胀策略,将预训练Transformer从2D适配至3D,同时保留迁移学习与深度信息的优势。我们进一步探究了不同预训练来源与目标对迁移效果的影响。我们的方法在广泛的3D医学图像数据集上取得了最先进的性能,并可成为所有基于Transformer的3D医学图像模型轻松采用的标准策略,以最大化性能。