Learning representations through self-supervision on unlabeled data has proven highly effective for understanding diverse images. However, remote sensing images often have complex and densely populated scenes with multiple land objects and no clear foreground objects. This intrinsic property generates high object density, resulting in false positive pairs or missing contextual information in self-supervised learning. To address these problems, we propose a context-enhanced masked image modeling method (CtxMIM), a simple yet efficient MIM-based self-supervised learning for remote sensing image understanding. CtxMIM formulates original image patches as a reconstructive template and employs a Siamese framework to operate on two sets of image patches. A context-enhanced generative branch is introduced to provide contextual information through context consistency constraints in the reconstruction. With the simple and elegant design, CtxMIM encourages the pre-training model to learn object-level or pixel-level features on a large-scale dataset without specific temporal or geographical constraints. Finally, extensive experiments show that features learned by CtxMIM outperform fully supervised and state-of-the-art self-supervised learning methods on various downstream tasks, including land cover classification, semantic segmentation, object detection, and instance segmentation. These results demonstrate that CtxMIM learns impressive remote sensing representations with high generalization and transferability. Code and data will be made public available.
翻译:通过在无标签数据上进行自监督学习来表征图像,已被证明对理解多样化图像非常有效。然而,遥感图像通常场景复杂且密集,包含多种地物目标且缺乏清晰的前景对象。这一固有特性导致目标密度高,从而在自监督学习中产生误正样本对或缺失上下文信息。为解决这些问题,我们提出一种上下文增强的掩码图像建模方法(CtxMIM),这是一种简单而高效的基于MIM的自监督学习方法,专为遥感图像理解设计。CtxMIM将原始图像块构建为重建模板,并采用孪生框架对两组图像块进行操作。通过引入上下文增强生成分支,在重建过程中利用上下文一致性约束提供上下文信息。凭借简洁优雅的设计,CtxMIM在无需特定时间或地理约束的大规模数据集上,鼓励预训练模型学习目标级或像素级特征。最后,大量实验表明,在土地覆盖分类、语义分割、目标检测和实例分割等多种下游任务中,CtxMIM学习到的特征优于全监督方法和最先进的自监督学习方法。这些结果证明,CtxMIM学习到的遥感表征具有高泛化性和可迁移性。代码与数据将公开发布。