The volume of unlabelled Earth observation (EO) data is huge, but many important applications lack labelled training data. However, EO data offers the unique opportunity to pair data from different modalities and sensors automatically based on geographic location and time, at virtually no human labor cost. We seize this opportunity to create MMEarth, a diverse multi-modal pretraining dataset at global scale. Using this new corpus of 1.2 million locations, we propose a Multi-Pretext Masked Autoencoder (MP-MAE) approach to learn general-purpose representations for optical satellite images. Our approach builds on the ConvNeXt V2 architecture, a fully convolutional masked autoencoder (MAE). Drawing upon a suite of multi-modal pretext tasks, we demonstrate that our MP-MAE approach outperforms both MAEs pretrained on ImageNet and MAEs pretrained on domain-specific satellite images. This is shown on several downstream tasks including image classification and semantic segmentation. We find that pretraining with multi-modal pretext tasks notably improves the linear probing performance compared to pretraining on optical satellite images only. This also leads to better label efficiency and parameter efficiency which are crucial aspects in global scale applications.
翻译:未标记的地球观测(EO)数据体量巨大,但许多重要应用缺乏标记的训练数据。然而,EO数据提供了独特的机会,能够基于地理位置和时间自动配对来自不同模态和传感器的数据,几乎无需人工成本。我们利用这一机会创建了MMEarth,一个全球尺度的多样化多模态预训练数据集。利用这个包含120万个位置的新语料库,我们提出了一种多预训练任务掩码自编码器(MP-MAE)方法,用于学习光学卫星图像的通用表征。我们的方法基于ConvNeXt V2架构,即一个全卷积掩码自编码器(MAE)。借助一套多模态预训练任务,我们证明了我们的MP-MAE方法在多个下游任务(包括图像分类和语义分割)上,其性能均优于在ImageNet上预训练的MAE以及在特定领域卫星图像上预训练的MAE。我们发现,与仅在光学卫星图像上进行预训练相比,使用多模态预训练任务进行预训练显著提高了线性探测性能。这也带来了更好的标签效率和参数效率,这两者在全球尺度应用中至关重要。