Recent advancements in large-scale Vision Transformers have made significant strides in improving pre-trained models for medical image segmentation. However, these methods face a notable challenge in acquiring a substantial amount of pre-training data, particularly within the medical field. To address this limitation, we present Masked Multi-view with Swin Transformers (SwinMM), a novel multi-view pipeline for enabling accurate and data-efficient self-supervised medical image analysis. Our strategy harnesses the potential of multi-view information by incorporating two principal components. In the pre-training phase, we deploy a masked multi-view encoder devised to concurrently train masked multi-view observations through a range of diverse proxy tasks. These tasks span image reconstruction, rotation, contrastive learning, and a novel task that employs a mutual learning paradigm. This new task capitalizes on the consistency between predictions from various perspectives, enabling the extraction of hidden multi-view information from 3D medical data. In the fine-tuning stage, a cross-view decoder is developed to aggregate the multi-view information through a cross-attention block. Compared with the previous state-of-the-art self-supervised learning method Swin UNETR, SwinMM demonstrates a notable advantage on several medical image segmentation tasks. It allows for a smooth integration of multi-view information, significantly boosting both the accuracy and data-efficiency of the model. Code and models are available at https://github.com/UCSC-VLAA/SwinMM/.
翻译:近期大规模视觉Transformer在提升医学图像分割预训练模型方面取得了显著进展,但这些方法面临获取海量预训练数据的显著挑战,尤其在医学领域。为突破这一限制,我们提出一种基于Swin Transformer的掩码多视图方法(SwinMM)——一种新颖的多视图流水线,旨在实现高精度且数据高效的医学图像自监督分析。该策略通过两大核心组件充分挖掘多视图信息的潜力:在预训练阶段,我们设计了一种掩码多视图编码器,通过多样化代理任务同步训练掩码多视图观测数据,涵盖图像重建、旋转、对比学习以及一种基于互学习范式的新任务——该任务利用不同视角预测结果的一致性,从三维医学数据中提取隐含的多视图信息;在微调阶段,我们开发了跨视图解码器,通过交叉注意力模块聚合多视图信息。与先前最先进的自监督学习方法Swin UNETR相比,SwinMM在多项医学图像分割任务中展现出显著优势,能够平滑集成多视图信息,大幅提升模型精度与数据效率。代码和模型已开源在https://github.com/UCSC-VLAA/SwinMM/。