Clouds remain a major obstacle in optical satellite imaging, limiting accurate environmental and climate analysis. To address the strong spectral variability and the large scale differences among cloud types, we propose MSCloudCAM, a novel multi-scale context adapter network with convolution based cross-attention tailored for multispectral and multi-sensor cloud segmentation. A key contribution of MSCloudCAM is the explicit modeling of multiple complementary multi-scale context extractors. And also, rather than simply stacking or concatenating their outputs, our formulation uses one extractor's fine-resolution features and the other extractor's global contextual representations enabling dynamic, scale-aware feature selection. Building on this idea, we design a new convolution-based cross attention adapter that effectively fuses localized, detailed information with broader multi-scale context. Integrated with a hierarchical vision backbone and refined through channel and spatial attention mechanisms, MSCloudCAM achieves strong spectral-spatial discrimination. Experiments on various multisensor datatsets e.g. CloudSEN12 (Sentinel-2) and L8Biome (Landsat-8), demonstrate that MSCloudCAM achieves superior overall segmentation performance and competitive class-wise accuracy compared to recent state-of-the-art models, while maintaining competitive model complexity, highlighting the novelty and effectiveness of the proposed design for large-scale Earth observation.
翻译:云层仍然是光学卫星成像的主要障碍,限制了准确的环境与气候分析。为解决云类型间强烈的光谱变异性和显著的尺度差异,我们提出MSCloudCAM——一种专为多光谱多传感器云分割设计的新型多尺度上下文自适应网络,其核心是基于卷积的交叉注意力机制。MSCloudCAM的关键贡献在于显式建模了多个互补的多尺度上下文提取器。更重要的是,我们的方案并非简单堆叠或拼接这些提取器的输出,而是利用一个提取器的高分辨率特征与另一个提取器的全局上下文表征,实现动态的尺度感知特征选择。基于此思想,我们设计了一种新的基于卷积的交叉注意力适配器,能够有效融合局部细节信息与更广泛的多尺度上下文。该网络结合分层视觉主干架构,并通过通道与空间注意力机制进行优化,实现了强大的光谱-空间判别能力。在多个多传感器数据集(如CloudSEN12(Sentinel-2)和L8Biome(Landsat-8))上的实验表明,MSCloudCAM在保持竞争力的模型复杂度同时,相比当前先进模型取得了更优的整体分割性能和具有竞争力的类别精度,凸显了所提设计在大规模地球观测任务中的新颖性与有效性。