Masked Autoencoder (MAE) has recently been shown to be effective in pre-training Vision Transformers (ViT) for natural image analysis. By reconstructing full images from partially masked inputs, a ViT encoder aggregates contextual information to infer masked image regions. We believe that this context aggregation ability is particularly essential to the medical image domain where each anatomical structure is functionally and mechanically connected to other structures and regions. Because there is no ImageNet-scale medical image dataset for pre-training, we investigate a self pre-training paradigm with MAE for medical image analysis tasks. Our method pre-trains a ViT on the training set of the target data instead of another dataset. Thus, self pre-training can benefit more scenarios where pre-training data is hard to acquire. Our experimental results show that MAE self pre-training markedly improves diverse medical image tasks including chest X-ray disease classification, abdominal CT multi-organ segmentation, and MRI brain tumor segmentation. Code is available at https://github.com/cvlab-stonybrook/SelfMedMAE
翻译:掩码自编码器(MAE)近期被证明在自然图像分析中有效预训练视觉变换器(ViT)。通过从部分掩码输入重建完整图像,ViT编码器聚合上下文信息以推断被掩码的图像区域。我们认为,这种上下文聚合能力对医学图像领域尤为关键,因为在该领域中每个解剖结构在功能和力学上均与其他结构及区域相互关联。由于缺乏用于预训练的ImageNet规模医学图像数据集,我们探索了一种基于MAE的自监督预训练范式用于医学图像分析任务。该方法在目标数据的训练集上预训练ViT,而非依赖其他数据集。因此,自监督预训练能够惠及更多难以获取预训练数据的场景。实验结果表明,MAE自监督预训练显著提升了多种医学图像任务,包括胸部X光疾病分类、腹部CT多器官分割以及MRI脑肿瘤分割。代码详见https://github.com/cvlab-stonybrook/SelfMedMAE