Satellite-based remote sensing has revolutionised the way we address global challenges in a rapidly evolving world. Huge quantities of Earth Observation (EO) data are generated by satellite sensors daily, but processing these large datasets for use in ML pipelines is technically and computationally challenging. Specifically, different types of EO data are often hosted on a variety of platforms, with differing availability for Python preprocessing tools. In addition, spatial alignment across data sources and data tiling can present significant technical hurdles for novice users. While some preprocessed EO datasets exist, their content is often limited to optical or near-optical wavelength data, which is ineffective at night or in adverse weather conditions. Synthetic Aperture Radar (SAR), an active sensing technique based on microwave length radiation, offers a viable alternative. However, the application of machine learning to SAR has been limited due to a lack of ML-ready data and pipelines, particularly for the full diversity of SAR data, including polarimetry, coherence and interferometry. We introduce M3LEO, a multi-modal, multi-label EO dataset that includes polarimetric, interferometric, and coherence SAR data derived from Sentinel-1, alongside Sentinel-2 RGB imagery and a suite of labelled tasks for model evaluation. M3LEO spans 17.5TB and contains approximately 10M data chips across six geographic regions. The dataset is complemented by a flexible PyTorch Lightning framework, with configuration management using Hydra. We provide tools to process any dataset available on popular platforms such as Google Earth Engine for integration with our framework. Initial experiments validate the utility of our data and framework, showing that SAR imagery contains information additional to that extractable from RGB data. Data at huggingface.co/M3LEO, and code at github.com/spaceml-org/M3LEO.
翻译:基于卫星的遥感技术彻底改变了我们在快速变化的世界中应对全球挑战的方式。每日卫星传感器生成海量地球观测(EO)数据,但将这些大型数据集用于机器学习流程在技术和计算层面均面临挑战。具体而言,不同类型的EO数据通常存储于多种平台,且Python预处理工具的可用性存在差异。此外,数据源间的空间对齐与数据分块可能对新手用户构成重大技术障碍。尽管存在部分预处理后的EO数据集,但其内容通常局限于光学或近光学波长数据,在夜间或恶劣天气条件下效果不佳。合成孔径雷达(SAR)作为一种基于微波辐射的有源感知技术,提供了可行的替代方案。然而,由于缺乏机器学习就绪的数据和流程(尤其针对极化、相干性及干涉测量等SAR数据的完整多样性),机器学习在SAR领域的应用一直受限。我们提出M3LEO——一个包含基于Sentinel-1的极化、干涉和相干性SAR数据、Sentinel-2 RGB图像以及用于模型评估的系列标注任务的多模态多标签EO数据集。M3LEO数据集规模达17.5TB,涵盖六个地理区域的约1000万数据切片。该数据集配套灵活的PyTorch Lightning框架,并使用Hydra进行配置管理。我们提供工具以处理Google Earth Engine等主流平台上的任何数据集,从而实现与框架的集成。初步实验验证了数据与框架的实用性,表明SAR图像包含RGB数据中无法提取的额外信息。数据访问:huggingface.co/M3LEO,代码访问:github.com/spaceml-org/M3LEO。