We present Masked Frequency Modeling (MFM), a unified frequency-domain-based approach for self-supervised pre-training of visual models. Instead of randomly inserting mask tokens to the input embeddings in the spatial domain, in this paper, we shift the perspective to the frequency domain. Specifically, MFM first masks out a portion of frequency components of the input image and then predicts the missing frequencies on the frequency spectrum. Our key insight is that predicting masked components in the frequency domain is more ideal to reveal underlying image patterns rather than predicting masked patches in the spatial domain, due to the heavy spatial redundancy. Our findings suggest that with the right configuration of mask-and-predict strategy, both the structural information within high-frequency components and the low-level statistics among low-frequency counterparts are useful in learning good representations. For the first time, MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token. Experimental results on image classification and semantic segmentation, as well as several robustness benchmarks show the competitive performance and advanced robustness of MFM compared with recent masked image modeling approaches. Furthermore, we also comprehensively investigate the effectiveness of classical image restoration tasks for representation learning from a unified frequency perspective and reveal their intriguing relations with our MFM approach.
翻译:本文提出掩码频率建模(Masked Frequency Modeling,MFM),一种基于频率域的统一方法,用于视觉模型的自监督预训练。与在空间域中向输入嵌入随机插入掩码令牌不同,本文将视角转向频率域。具体而言,MFM首先掩码输入图像的部分频率分量,然后在频谱上预测缺失的频率。我们的核心见解是:由于空间域存在严重的冗余性,在频率域中预测掩码分量比在空间域中预测掩码图像块更能揭示底层图像模式。研究发现,若掩码-预测策略配置得当,高频分量中的结构信息和低频分量中的低级统计信息对于学习良好表征均有用。MFM首次证明,对于ViT和CNN,一种简单的非孪生框架无需以下任何条件即可学习有意义的表征:(i)额外数据,(ii)额外模型,(iii)掩码令牌。在图像分类、语义分割以及多个鲁棒性基准上的实验结果表明,相比近期掩码图像建模方法,MFM具有竞争性性能和优越鲁棒性。此外,我们还从统一频率视角全面探究了经典图像修复任务对表征学习的有效性,并揭示了其与MFM方法之间的有趣关联。