Beyond Standard Datacubes: Extracting Features from Irregular and Branching Earth System Data

Earth science datasets are growing rapidly in both volume and structural complexity. They increasingly contain richly labelled data with heterogeneous metadata and complex internal constraints that impose dependencies between variables and dimensions. Datacubes have become a common abstraction for organising such datasets, but traditional dense and orthogonal datacube models struggle to represent irregular, sparse or branching data spaces efficiently. In this paper, we introduce a generalised data hypercube representation based on compressed tree structures, which enables an accurate and compact description of complex data spaces. We describe the design of this representation and analyse its ability to capture sparsity and conditional relationships while remaining efficient to traverse. Using a concrete implementation, we study the performance characteristics of compressed tree data hypercubes and demonstrate their effectiveness as fast, cache-like indices over large backend data stores. Building on this representation, we present an integrated feature extraction system that operates directly on tree-based data hypercubes within the Polytope framework. By embedding data access strategies into the data hypercube abstraction itself, the system enables precise, sub-field data extraction and supports flexible, user-driven access patterns. We evaluate the performance of the integrated system and show how it enables new ways of interacting with complex datasets that are difficult to support using traditional access models. This work bridges the gap between expressive data hypercube models and efficient data access methods. In particular, it provides a unified framework that combines tree-based data representations with feature extraction capabilities. The proposed approach therefore offers a foundation for scalable and user-centric access to large heterogeneous Earth science datasets.

翻译：地球科学数据集在数据量和结构复杂性方面正迅速增长。这些数据集越来越多地包含具有异构元数据和复杂内部约束的丰富标注数据，这些约束在变量与维度之间施加了依赖关系。数据立方体已成为组织此类数据集的常见抽象模型，但传统的稠密正交数据立方体模型难以高效表示非规则、稀疏或分支的数据空间。本文提出一种基于压缩树结构的广义数据超立方体表示方法，能够对复杂数据空间进行精确且紧凑的描述。我们阐述了该表示方法的设计，并分析了其在保持高效遍历能力的同时捕获稀疏性与条件关系的能力。通过具体实现，我们研究了压缩树数据超立方体的性能特征，并证明其作为大型后端数据存储的快速缓存式索引的有效性。基于此表示方法，我们在Polytope框架内提出一个可直接在基于树的数据超立方体上运行的集成特征提取系统。通过将数据访问策略嵌入数据超立方体抽象本身，该系统实现了精确的子字段数据提取，并支持灵活的用户驱动访问模式。我们评估了集成系统的性能，展示了其如何实现与传统访问模型难以支持的复杂数据集交互新方式。本工作弥合了表达性数据超立方体模型与高效数据访问方法之间的鸿沟，特别提供了一个将基于树的数据表示与特征提取能力相结合的统一框架。因此，所提出的方法为大规模异构地球科学数据集的可扩展且以用户为中心的访问奠定了基础。