Vision-specific concepts such as "region" have played a key role in extending general machine learning frameworks to tasks like object detection. Given the success of region-based detectors for supervised learning and the progress of intra-image methods for contrastive learning, we explore the use of regions for reconstructive pre-training. Starting from Masked Autoencoding (MAE) both as a baseline and an inspiration, we propose a parallel pre-text task tailored to address the one-to-many mapping between images and regions. Since such regions can be generated in an unsupervised way, our approach (R-MAE) inherits the wide applicability from MAE, while being more "region-aware". We conduct thorough analyses during the development of R-MAE, and converge on a variant that is both effective and efficient (1.3% overhead over MAE). Moreover, it shows consistent quantitative improvements when generalized to various pre-training data and downstream detection and segmentation benchmarks. Finally, we provide extensive qualitative visualizations to enhance the understanding of R-MAE's behaviour and potential. Code will be made available at https://github.com/facebookresearch/r-mae.
翻译:诸如“区域”这类视觉特定概念,在将通用机器学习框架扩展到目标检测等任务中发挥了关键作用。鉴于基于区域的检测器在监督学习中的成功,以及图像内方法在对比学习中的进展,我们探索了区域在重建预训练中的应用。以掩码自编码(MAE)作为基线和启发,我们提出了一种并行预文本任务,专门用于解决图像与区域之间的一对多映射问题。由于此类区域可以以无监督方式生成,我们的方法(R-MAE)继承了MAE的广泛适用性,同时更具“区域感知”能力。在R-MAE开发过程中,我们进行了深入分析,并最终确定了一种兼具有效性和高效性的变体(相比MAE仅增加1.3%开销)。此外,当泛化到多种预训练数据及下游检测与分割基准时,它表现出一致的量化性能提升。最后,我们提供了大量定性可视化结果,以增强对R-MAE行为与潜力的理解。代码将于https://github.com/facebookresearch/r-mae公开。