Learning representations through self-supervision on unlabeled data has proven highly effective for understanding diverse images. However, remote sensing images often have complex and densely populated scenes with multiple land objects and no clear foreground objects. This intrinsic property generates high object density, resulting in false positive pairs or missing contextual information in self-supervised learning. To address these problems, we propose a context-enhanced masked image modeling method (CtxMIM), a simple yet efficient MIM-based self-supervised learning for remote sensing image understanding. CtxMIM formulates original image patches as a reconstructive template and employs a Siamese framework to operate on two sets of image patches. A context-enhanced generative branch is introduced to provide contextual information through context consistency constraints in the reconstruction. With the simple and elegant design, CtxMIM encourages the pre-training model to learn object-level or pixel-level features on a large-scale dataset without specific temporal or geographical constraints. Finally, extensive experiments show that features learned by CtxMIM outperform fully supervised and state-of-the-art self-supervised learning methods on various downstream tasks, including land cover classification, semantic segmentation, object detection, and instance segmentation. These results demonstrate that CtxMIM learns impressive remote sensing representations with high generalization and transferability. Code and data will be made public available.
翻译:通过学习无标记数据上的自监督表征已被证明对理解多样图像极为有效。然而,遥感图像通常包含密集且复杂的场景,其中存在多种地物目标且缺乏清晰的前景对象。这一内在特性导致目标密度过高,从而在自监督学习中产生误正样本对或缺失上下文信息。为解决上述问题,我们提出一种上下文增强掩码图像建模方法(CtxMIM),这是一种面向遥感图像理解、基于MIM的简洁高效自监督学习方法。CtxMIM将原始图像块重构为重建模板,并采用孪生框架对两组图像块进行操作。通过引入上下文增强生成分支,在重建过程中利用上下文一致性约束提供上下文信息。凭借简洁优雅的设计,CtxMIM无需特定时间或地理约束,即可促使预训练模型在大规模数据集上学习目标级或像素级特征。最后,大量实验表明,在土地覆盖分类、语义分割、目标检测及实例分割等多项下游任务中,CtxMIM学习的特征均优于全监督方法及最先进的自监督学习方法。这些结果证明CtxMIM能够学习到具有高泛化性与迁移能力的出色遥感表征。相关代码与数据将公开提供。