The research community has witnessed the powerful potential of self-supervised Masked Image Modeling (MIM), which enables the models capable of learning visual representation from unlabeled data.In this paper, to incorporate both the crucial global structural information and local details for dense prediction tasks, we alter the perspective to the frequency domain and present a new MIM-based framework named FreMIM for self-supervised pre-training to better accomplish medical image segmentation task. Based on the observations that the detailed structural information mainly lies in the high-frequency components and the high-level semantics are abundant in the low-frequency counterparts, we further incorporate multi-stage supervision to guide the representation learning during the pre-training phase. Extensive experiments on three benchmark datasets show the superior advantage of our FreMIM over previous state-of-the-art MIM methods. Compared with various baselines trained from scratch, our FreMIM could consistently bring considerable improvements to model performance. The code will be made publicly available.
翻译:在自监督掩码图像建模(MIM)领域,研究者已见证其从无标签数据中学习视觉表征的强大潜力。为兼顾密集预测任务所需的关键全局结构信息与局部细节,本文转换至频域视角,提出基于MIM的新型自监督预训练框架FreMIM,以更好地完成医学图像分割任务。基于结构细节信息主要存在于高频分量、高阶语义信息集中于低频分量的观察,我们进一步引入多阶段监督机制,在预训练阶段指导表征学习。在三个基准数据集上的大量实验表明,我们的FreMIM相比先前最先进的MIM方法具有显著优势。与各类从头训练的基线模型相比,FreMIM能够持续为模型性能带来可观提升。相关代码将公开发布。