In this work, we explore regions as a potential visual analogue of words for self-supervised image representation learning. Inspired by Masked Autoencoding (MAE), a generative pre-training baseline, we propose masked region autoencoding to learn from groups of pixels or regions. Specifically, we design an architecture which efficiently addresses the one-to-many mapping between images and regions, while being highly effective especially with high-quality regions. When integrated with MAE, our approach (R-MAE) demonstrates consistent improvements across various pre-training datasets and downstream detection and segmentation benchmarks, with negligible computational overheads. Beyond the quantitative evaluation, our analysis indicates the models pre-trained with masked region autoencoding unlock the potential for interactive segmentation. The code is provided at https://github.com/facebookresearch/r-mae.
翻译:本文探索将区域作为自监督图像表示学习中单词的潜在视觉对应物。受生成式预训练基线——掩码自编码器(MAE)启发,我们提出掩码区域自编码方法,从像素组或区域中进行学习。具体而言,我们设计了一种架构,高效解决了图像与区域之间的一对多映射问题,尤其在高质量区域下表现出高度有效性。当与MAE集成时,我们的方法(R-MAE)在多种预训练数据集及下游检测和分割基准任务中均取得一致性改进,且计算开销可忽略不计。除定量评估外,我们的分析表明,通过掩码区域自编码预训练的模型释放了交互式分割的潜力。代码已开源至https://github.com/facebookresearch/r-mae。