Data extraction algorithms on data hypercubes, or datacubes, are traditionally only capable of cutting boxes of data along the datacube axes. For many use cases however, this is not a sufficient approach and returns more data than users might actually need. This not only forces users to apply post-processing after extraction, but more importantly this consumes more I/O resources than is necessary. When considering very large datacubes from which users only want to extract small non-rectangular subsets, the box approach does not scale well. Indeed, with this traditional approach, I/O systems quickly reach capacity, trying to read and return unwanted data to users. In this paper, we propose a novel technique, based on computational geometry concepts, which instead carefully pre-selects the precise bytes of data which the user needs in order to then only read those from the datacube. As we discuss later on, this novel extraction method will considerably help scale access to large petabyte size data hypercubes in a variety of scientific fields.
翻译:摘要:传统上,对数据超立方体(即数据立方体)的数据提取算法仅能沿数据立方体坐标轴切割数据块。然而,在许多应用场景中,这种方法并不充分,会返回超出用户实际需求的数据量。这不仅迫使用户在提取后执行后处理,更重要的是会消耗更多不必要的I/O资源。当用户仅需从超大规模数据立方体中提取小型非矩形子集时,这种箱型方法扩展性不佳。实际上,采用传统方法时,I/O系统会因尝试读取并返回用户无需的数据而迅速达到容量极限。本文基于计算几何概念提出一种新型技术,该技术通过精细预选用户所需的精确数据字节,仅从数据立方体中读取这些字节。后续讨论将表明,这种新型提取方法将显著帮助科学领域中多种场景实现PB级超大规模数据立方体的可扩展访问。