The field of Earth Observations (EO) offers a wealth of data from diverse sensors, presenting a great opportunity for advancing self-supervised multimodal learning. However, current multimodal EO datasets and models focus on a single data type, either mono-date images or time series, which limits their expressivity. We introduce OmniSat, a novel architecture that exploits the spatial alignment between multiple EO modalities to learn expressive multimodal representations without labels. To demonstrate the advantages of combining modalities of different natures, we augment two existing datasets with new modalities. As demonstrated on three downstream tasks: forestry, land cover classification, and crop mapping. OmniSat can learn rich representations in an unsupervised manner, leading to improved performance in the semi- and fully-supervised settings, even when only one modality is available for inference. The code and dataset are available at https://github.com/gastruc/OmniSat.
翻译:地球观测(EO)领域提供了来自多种传感器的丰富数据,为推进自监督多模态学习带来了重要机遇。然而,当前的多模态EO数据集和模型仅关注单一数据类型,即单时相图像或时间序列,这限制了其表达能力。本文提出OmniSat,一种新颖的架构,它利用多个EO模态之间的空间对齐特性,在无标签条件下学习具有强表达力的多模态表示。为展示融合不同性质模态的优势,我们为两个现有数据集增加了新模态。通过在三个下游任务(林业监测、土地覆盖分类和作物制图)上的实验验证表明,OmniSat能够以无监督方式学习丰富的表示,从而在半监督和全监督设置下提升性能,即使在推理时仅有一种模态可用的情况下亦然。代码与数据集已公开于 https://github.com/gastruc/OmniSat。