The research community has witnessed the powerful potential of self-supervised Masked Image Modeling (MIM), which enables the models capable of learning visual representation from unlabeled data. In this paper, to incorporate both the crucial global structural information and local details for dense prediction tasks, we alter the perspective to the frequency domain and present a new MIM-based framework named FreMAE for self-supervised pre-training for medical image segmentation. Based on the observations that the detailed structural information mainly lies in the high-frequency components and the high-level semantics are abundant in the low-frequency counterparts, we further incorporate multi-stage supervision to guide the representation learning during the pre-training phase. Extensive experiments on three benchmark datasets show the superior advantage of our proposed FreMAE over previous state-of-the-art MIM methods. Compared with various baselines trained from scratch, our FreMAE could consistently bring considerable improvements to the model performance. To the best our knowledge, this is the first attempt towards MIM with Fourier Transform in medical image segmentation.
翻译:研究界已见证了自监督掩码图像建模(MIM)在使模型能够从无标注数据中学习视觉表征方面的强大潜力。本文中,为了在密集预测任务中融合关键的全局结构信息与局部细节,我们将视角转换至频域,提出了一种基于MIM的新框架——FreMAE,用于医学图像分割的自监督预训练。基于详细结构信息主要存在于高频分量而高层语义信息富集于低频分量的观察,我们进一步引入多阶段监督,以指导预训练阶段的表征学习。在三个基准数据集上的大量实验表明,我们提出的FreMAE相较于先前最先进的MIM方法具有显著优势。与多种从头训练的基线方法相比,FreMAE能够持续为模型性能带来大幅提升。据我们所知,这是将傅里叶变换与MIM相结合用于医学图像分割的首次尝试。