A Novel Approach to Translate Structural Aggregation Queries to MapReduce Code

Data management applications are growing and require more attention, especially in the "big data" era. Thus, supporting such applications with novel and efficient algorithms that achieve higher performance is critical. Array database management systems are one way to support these applications by dealing with data represented in n-dimensional data structures. For instance, software like SciDB and RasDaMan can be powerful tools to achieve the required performance on large-scale problems with multidimensional data. Like their relational counterparts, these management systems support specific array query languages as the user interface. As a popular programming model, MapReduce allows large-scale data analysis, facilitates query processing, and is used as a DB engine. Nevertheless, one major obstacle is the low productivity of developing MapReduce applications. Unlike high-level declarative languages such as SQL, MapReduce jobs are written in a low-level descriptive language, often requiring massive programming efforts and complicated debugging processes. This work presents a system that supports translating array queries expressed in the Array Query Language (AQL) in SciDB into MapReduce jobs. We focus on translating some unique structural aggregations, including circular, grid, hierarchical, and sliding aggregations. Unlike traditional aggregations in relational DBs, these structural aggregations are designed explicitly for array manipulation. Thus, our work can be considered an array-view counterpart of existing SQL to MapReduce translators like HiveQL and YSmart. Our translator supports structural aggregations over arrays to meet various array manipulations. The translator can also help user-defined aggregation functions with minimal user effort. We show that our translator can generate optimized MapReduce code, which performs better than the short handwritten code by up to 10.84x.

翻译：数据管理应用日益增长并需要更多关注，尤其在"大数据"时代。因此，通过新颖高效的算法来支持这些应用以实现更高性能至关重要。数组数据库管理系统通过处理以n维数据结构表示的数据，是支持这些应用的一种方式。例如，SciDB和RasDaMan等软件可以成为处理多维数据大规模问题并实现所需性能的强大工具。与关系型数据库类似，这些管理系统支持特定的数组查询语言作为用户接口。MapReduce作为一种流行的编程模型，支持大规模数据分析，促进查询处理，并被用作数据库引擎。然而，一个主要障碍是开发MapReduce应用的低效率。与SQL等高级声明式语言不同，MapReduce作业采用低级描述性语言编写，通常需要大量编程工作和复杂的调试过程。本研究提出一个支持将SciDB中数组查询语言（AQL）表达的数组查询转换为MapReduce作业的系统。我们专注于转换某些独特的结构化聚合操作，包括圆形、网格、层次和滑动聚合。与传统关系型数据库中的聚合不同，这些结构化聚合是专门为数组操作设计的。因此，我们的工作可被视为现有SQL到MapReduce转换器（如HiveQL和YSmart）的数组视图对应物。我们的转换器支持对数组进行结构化聚合以满足各种数组操作需求。该转换器还能以最小的用户工作量支持用户自定义聚合函数。我们证明，该转换器能够生成优化的MapReduce代码，其性能比简短手写代码最高提升10.84倍。