MiM: Mask in Mask Self-Supervised Pre-Training for 3D Medical Image Analysis

The Vision Transformer (ViT) has demonstrated remarkable performance in Self-Supervised Learning (SSL) for 3D medical image analysis. Mask AutoEncoder (MAE) for feature pre-training can further unleash the potential of ViT on various medical vision tasks. However, due to large spatial sizes with much higher dimensions of 3D medical images, the lack of hierarchical design for MAE may hinder the performance of downstream tasks. In this paper, we propose a novel \textit{Mask in Mask (MiM)} pre-training framework for 3D medical images, which aims to advance MAE by learning discriminative representation from hierarchical visual tokens across varying scales. We introduce multiple levels of granularity for masked inputs from the volume, which are then reconstructed simultaneously ranging at both fine and coarse levels. Additionally, a cross-level alignment mechanism is applied to adjacent level volumes to enforce anatomical similarity hierarchically. Furthermore, we adopt a hybrid backbone to enhance the hierarchical representation learning efficiently during the pre-training. MiM was pre-trained on a large scale of available 3D volumetric images, \textit{i.e.,} Computed Tomography (CT) images containing various body parts. Extensive experiments on thirteen public datasets demonstrate the superiority of MiM over other SSL methods in organ/lesion/tumor segmentation and disease classification. We further scale up the MiM to large pre-training datasets with more than 10k volumes, showing that large-scale pre-training can further enhance the performance of downstream tasks. The improvement also concluded that the research community should pay more attention to the scale of the pre-training dataset towards the healthcare foundation model for 3D medical images.

翻译：视觉Transformer（ViT）在3D医学图像分析的自监督学习（SSL）中展现出卓越性能。基于掩码自编码器（MAE）的特征预训练可进一步释放ViT在各类医学视觉任务中的潜力。然而，由于3D医学图像具有更高维度的巨大空间尺寸，MAE缺乏层次化设计可能制约下游任务性能。本文提出一种新型的面向3D医学图像的掩码中的掩码（MiM）预训练框架，旨在通过从跨尺度的层次化视觉标记中学习判别性表征来推进MAE发展。我们从三维体数据中引入多层级粒度的掩码输入，并在精细与粗糙两个层级实现同步重建。此外，采用跨层级对齐机制对相邻层级体数据进行强制约束，以层次化方式实现解剖相似性学习。同时，我们采用混合骨干网络在预训练阶段高效增强层次化表征学习。MiM在包含多部位的大规模可用3D体图像（即计算机断层扫描CT图像）上进行预训练，并在13个公开数据集上的大量实验证明，其在器官/病灶/肿瘤分割及疾病分类任务中优于其他SSL方法。我们进一步将MiM扩展至超过1万个体数据的大规模预训练数据集，表明大规模预训练能持续提升下游任务性能。该改进亦指出，研究社区应更加关注面向3D医学图像医疗基础模型的预训练数据规模。