Deep learning models are increasingly data-hungry, requiring significant resources to collect and compile the datasets needed to train them, with Earth Observation (EO) models being no exception. However, the landscape of datasets in EO is relatively atomised, with interoperability made difficult by diverse formats and data structures. If ever larger datasets are to be built, and duplication of effort minimised, then a shared framework that allows users to combine and access multiple datasets is needed. Here, Major TOM (Terrestrial Observation Metaset) is proposed as this extensible framework. Primarily, it consists of a geographical indexing system based on a set of grid points and a metadata structure that allows multiple datasets with different sources to be merged. Besides the specification of Major TOM as a framework, this work also presents a large, open-access dataset, MajorTOM-Core, which covers the vast majority of the Earth's land surface. This dataset provides the community with both an immediately useful resource, as well as acting as a template for future additions to the Major TOM ecosystem. Access: https://huggingface.co/Major-TOM
翻译:深度学习模型对数据的需求日益增长,需要大量资源来收集和编译训练这些模型所需的数据集,地球观测(EO)模型也不例外。然而,地球观测数据集的分布相对分散,多样的格式和数据结构导致互操作性困难。如果要构建更大的数据集并减少重复劳动,就需要一个共享框架,使用户能够合并和访问多个数据集。本文提出Major TOM(Terrestrial Observation Metaset)作为这一可扩展框架。该框架主要包括基于一组网格点的地理索引系统和允许合并不同来源的多个数据集的元数据结构。除了作为框架的Major TOM规范外,本研究还提供了一个大型开放访问数据集MajorTOM-Core,覆盖地球陆地表面的绝大部分区域。该数据集不仅为社区提供了即时可用的资源,也为Major TOM生态系统的未来扩展提供了模板。访问地址:https://huggingface.co/Major-TOM