Self-supervised visual pre-training methods face an inherent tension: contrastive learning (CL) captures global semantics but loses fine-grained detail, while masked image modeling (MIM) preserves local textures but suffers from "attention drift" due to semantically-agnostic random masking. We propose C2FMAE, a coarse-to-fine masked autoencoder that resolves this tension by explicitly learning hierarchical visual representations across three data granularities: semantic masks (scene-level), instance masks (object-level), and RGB images (pixel-level). Two synergistic innovations enforce a strict top-down learning principle. First, a cascaded decoder sequentially reconstructs from scene semantics to object instances to pixel details, establishing explicit cross-granularity dependencies that parallel decoders cannot capture. Second, a progressive masking curriculum dynamically shifts the training focus from semantic-guided to instance-guided and finally to random masking, creating a structured learning path from global context to local features. To support this framework, we construct a large-scale multi-granular dataset with high-quality pseudo-labels for all 1.28M ImageNet-1K images. Extensive experiments show that C2FMAE achieves significant performance gains on image classification, object detection, and semantic segmentation, validating the effectiveness of our hierarchical design in learning more robust and generalizable representations.
翻译:自监督视觉预训练方法面临一个固有矛盾:对比学习(CL)能捕获全局语义但会丢失细粒度细节,而掩码图像建模(MIM)能保留局部纹理,却因语义无关的随机掩码而遭受“注意力漂移”问题。我们提出C2FMAE,一种由粗到精的掩码自编码器,通过显式学习跨越三种数据粒度的层次化视觉表示来解决这一矛盾:语义掩码(场景级)、实例掩码(物体级)和RGB图像(像素级)。两项协同创新强制执行严格的自顶向下学习原则。首先,级联解码器依次从场景语义重建到物体实例再到像素细节,建立了并行解码器无法捕获的显式跨粒度依赖关系。其次,渐进式掩码课程动态地将训练重点从语义引导掩码转向实例引导掩码,最终到随机掩码,创建了一条从全局上下文到局部特征的结构化学习路径。为支持此框架,我们构建了一个大规模多粒度数据集,为所有128万张ImageNet-1K图像提供了高质量伪标签。大量实验表明,C2FMAE在图像分类、物体检测和语义分割任务上均取得显著性能提升,验证了我们的层次化设计在学习更鲁棒、更具泛化能力的表示方面的有效性。