CMID: A Unified Self-Supervised Learning Framework for Remote Sensing Image Understanding

Self-supervised learning (SSL) has gained widespread attention in the remote sensing (RS) and earth observation (EO) communities owing to its ability to learn task-agnostic representations without human-annotated labels. Nevertheless, most existing RS SSL methods are limited to learning either global semantic separable or local spatial perceptible representations. We argue that this learning strategy is suboptimal in the realm of RS, since the required representations for different RS downstream tasks are often varied and complex. In this study, we proposed a unified SSL framework that is better suited for RS images representation learning. The proposed SSL framework, Contrastive Mask Image Distillation (CMID), is capable of learning representations with both global semantic separability and local spatial perceptibility by combining contrastive learning (CL) with masked image modeling (MIM) in a self-distillation way. Furthermore, our CMID learning framework is architecture-agnostic, which is compatible with both convolutional neural networks (CNN) and vision transformers (ViT), allowing CMID to be easily adapted to a variety of deep learning (DL) applications for RS understanding. Comprehensive experiments have been carried out on four downstream tasks (i.e. scene classification, semantic segmentation, object-detection, and change detection) and the results show that models pre-trained using CMID achieve better performance than other state-of-the-art SSL methods on multiple downstream tasks. The code and pre-trained models will be made available at https://github.com/NJU-LHRS/official-CMID to facilitate SSL research and speed up the development of RS images DL applications.

翻译：自监督学习（SSL）因其无需人工标注标签即可学习任务无关表征的能力，在地球观测（EO）与遥感（RS）领域引起了广泛关注。然而，现有大多数遥感自监督学习方法局限于学习全局语义可分或局部空间可感知的表征。我们认为这种学习策略在遥感领域并非最优，因为不同遥感下游任务所需的表征往往是多样且复杂的。本研究提出了一种更适合遥感图像表征学习的统一自监督框架。该框架名为对比掩码图像蒸馏（CMID），通过将对比学习（CL）与掩码图像建模（MIM）以自蒸馏方式相结合，能够学习同时具备全局语义可分性和局部空间可感知性的表征。此外，CMID学习框架具有架构无关性，兼容卷积神经网络（CNN）和视觉Transformer（ViT），可轻松适配遥感理解领域的各类深度学习（DL）应用。我们在四个下游任务（场景分类、语义分割、目标检测和变化检测）上开展了全面实验，结果表明，采用CMID预训练的模型在多个下游任务上的性能均优于其他先进的自监督学习方法。相关代码与预训练模型已开源至https://github.com/NJU-LHRS/official-CMID，以促进自监督学习研究并加速遥感图像深度学习应用发展。