Deep learning models are increasingly data-hungry, requiring significant resources to collect and compile the datasets needed to train them, with Earth Observation (EO) models being no exception. However, the landscape of datasets in EO is relatively atomised, with interoperability made difficult by diverse formats and data structures. If ever larger datasets are to be built, and duplication of effort minimised, then a shared framework that allows users to combine and access multiple datasets is needed. Here, Major TOM (Terrestrial Observation Metaset) is proposed as this extensible framework. Primarily, it consists of a geographical indexing system based on a set of grid points and a metadata structure that allows multiple datasets with different sources to be merged. Besides the specification of Major TOM as a framework, this work also presents a large, open-access dataset, MajorTOM-Core, which covers the vast majority of the Earth's land surface. This dataset provides the community with both an immediately useful resource, as well as acting as a template for future additions to the Major TOM ecosystem. Access: https://huggingface.co/Major-TOM
翻译:深度学习模型日益需要海量数据,收集和整理训练这些模型所需的数据集需要大量资源,地球观测(EO)模型也不例外。然而,地球观测领域的数据集现状相对碎片化,多样的格式和数据结构使得互操作性难以实现。若要构建更庞大的数据集并尽量减少重复工作,则需要一个允许用户整合和访问多个数据集的共享框架。本文提出的Major TOM(Terrestrial Observation Metaset)正是这样一个可扩展框架。其核心主要包括一个基于网格点集的地理索引系统,以及一个允许合并不同来源多数据集的元数据结构。除了将Major TOM规范作为框架提出外,本研究还发布了一个大型开放访问数据集MajorTOM-Core,该数据集覆盖了地球绝大部分陆地表面。该数据集不仅为研究社区提供了可直接使用的宝贵资源,同时也为未来Major TOM生态系统的扩展提供了模板。访问地址:https://huggingface.co/Major-TOM