Clouds in satellite imagery pose a significant challenge for downstream applications. A major challenge in current cloud removal research is the absence of a comprehensive benchmark and a sufficiently large and diverse training dataset. To address this problem, we introduce the largest public dataset -- $\textit{AllClear}$ for cloud removal, featuring 23,742 globally distributed regions of interest (ROIs) with diverse land-use patterns, comprising 4 million images in total. Each ROI includes complete temporal captures from the year 2022, with (1) multi-spectral optical imagery from Sentinel-2 and Landsat 8/9, (2) synthetic aperture radar (SAR) imagery from Sentinel-1, and (3) auxiliary remote sensing products such as cloud masks and land cover maps. We validate the effectiveness of our dataset by benchmarking performance, demonstrating the scaling law -- the PSNR rises from $28.47$ to $33.87$ with $30\times$ more data, and conducting ablation studies on the temporal length and the importance of individual modalities. This dataset aims to provide comprehensive coverage of the Earth's surface and promote better cloud removal results.
翻译:卫星影像中的云层对下游应用构成了重大挑战。当前去云研究面临的一个主要难题是缺乏全面的基准测试以及足够大规模和多样化的训练数据集。为解决此问题,我们推出了目前最大的公开去云数据集——$\textit{AllClear}$,该数据集包含23,742个全球分布且具有多样化土地利用模式的感兴趣区域(ROI),总计涵盖400万幅影像。每个ROI均包含2022年完整的时序采集数据,具体包括:(1) 来自Sentinel-2与Landsat 8/9的多光谱光学影像,(2) 来自Sentinel-1的合成孔径雷达(SAR)影像,以及(3) 云掩膜与土地覆盖图等辅助遥感产品。我们通过基准性能测试验证了数据集的有效性,证明了数据规模效应——当数据量增加$30$倍时,PSNR从$28.47$提升至$33.87$,并对时序长度及各模态数据的重要性进行了消融实验。本数据集旨在实现对地球表面的全面覆盖,并推动取得更优的去云效果。