This work presents SSL4EO-S12 v1.1, a multimodal, multitemporal Earth Observation dataset designed for pretraining large-scale foundation models. Building on the success of SSL4EO-S12, this extension updates the previous version to fix geospatial alignment inaccuracies and the inefficent data structure. The dataset allows low-barrier, analysis-ready data loading while maintaining the predecessor's spatial coverage of the world's 10,000 largest cities and surrounding geographies, resulting in 246k time series with nearly one million image patches. We package each time series in Zarr file format stored in WebDataset tar shards for efficient data loading and representation of meta-information such as cloud masks. We add new modalities for elevation, land-cover, and vegetation to support multimodal pre-training. Released under the CC-BY-4.0 license, SSL4EO-S12 v1.1 facilitates open research and provides a robust foundation for future advancements in self-supervised learning and geospatial analysis. The dataset is available online through https://huggingface.co/datasets/embed2scale/SSL4EO-S12-v1.1.
翻译:本研究介绍了SSL4EO-S12 v1.1,这是一个为预训练大规模基础模型而设计的多模态、多时相地球观测数据集。该扩展版本基于SSL4EO-S12的成功,对先前版本进行了更新,修正了地理空间对齐的误差并改进了低效的数据结构。该数据集提供了低门槛、即用型的数据加载方式,同时保持了其前身对全球10,000个最大城市及周边区域的空间覆盖范围,共包含24.6万条时间序列和近百万个图像块。我们将每条时间序列以Zarr文件格式打包,存储在WebDataset tar分片中,以实现高效的数据加载和元信息(如云掩膜)的表示。我们新增了高程、土地覆盖和植被等新模态,以支持多模态预训练。SSL4EO-S12 v1.1在CC-BY-4.0许可下发布,促进了开放研究,并为未来自监督学习和地理空间分析的发展提供了坚实的基础。该数据集可通过 https://huggingface.co/datasets/embed2scale/SSL4EO-S12-v1.1 在线获取。