Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and modalities makes a single universal model unrealistic. Multiple specialized EO foundation models (EOFMs) will likely coexist, making efficient knowledge transfer across modalities essential. Most existing EO pretraining relies on masked image modeling, which emphasizes local reconstruction but provides limited control over global semantic structure. To address this, we propose a dual-teacher contrastive distillation framework for multispectral imagery that aligns the student's pretraining objective with the contrastive self-distillation paradigm of modern optical vision foundation models (VFMs). Our approach combines a multispectral teacher with an optical VFM teacher, enabling coherent cross-modal representation learning. Experiments across diverse optical and multispectral benchmarks show that our model adapts to multispectral data without compromising performance on optical-only inputs, achieving state-of-the-art results in both settings, with an average improvement of 3.64 percentage points in semantic segmentation, 1.2 in change detection, and 1.31 in classification tasks. This demonstrates that contrastive distillation provides a principled and efficient approach to scalable representation learning across heterogeneous EO data sources. Code: Coming soon.
翻译:基础模型正在变革地球观测领域,然而地球观测传感器与模态的多样性使得单一的通用模型难以实现。多个专业化的地球观测基础模型将可能共存,这使得跨模态的高效知识迁移变得至关重要。现有的大多数地球观测预训练依赖于掩码图像建模,该方法强调局部重建但对全局语义结构的控制有限。为解决此问题,我们提出了一种用于多光谱图像的双教师对比蒸馏框架,该框架将学生的预训练目标与现代光学视觉基础模型的对比自蒸馏范式对齐。我们的方法结合了一个多光谱教师与一个光学视觉基础模型教师,实现了连贯的跨模态表征学习。在多种光学与多光谱基准测试上的实验表明,我们的模型能够适应多光谱数据,同时不损害纯光学输入上的性能,在两种设置下均取得了最先进的结果,在语义分割任务上平均提升3.64个百分点,在变化检测任务上提升1.2个百分点,在分类任务上提升1.31个百分点。这证明了对比蒸馏为跨异构地球观测数据源的可扩展表征学习提供了一种原则性且高效的途径。代码:即将发布。