The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more complexities to data engineering applications, which are now integrated into data processing pipelines to process terabytes of data. Typically, a significant amount of time is spent on data preprocessing in these pipelines, and hence improving its e fficiency directly impacts the overall pipeline performance. The community has recently embraced the concept of Dataframes as the de-facto data structure for data representation and manipulation. However, the most widely used serial Dataframes today (R, pandas) experience performance limitations while working on even moderately large data sets. We believe that there is plenty of room for improvement by taking a look at this problem from a high-performance computing point of view. In a prior publication, we presented a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon [1]. In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns. Furthermore, we evaluate the performance of Cylon on the ORNL Summit supercomputer.
翻译:数据科学领域在过去十年间在研究界和工业界均实现了巨大扩张,这主要得益于大数据革命。人工智能与机器学习正为数据工程应用带来更多复杂性,这些应用如今已集成到数据处理管道中,用于处理数万亿字节的数据。通常,这些管道中的大量时间耗费在数据预处理环节,因此提升其效率将直接影响整个管道的性能。近期,该领域已普遍采用数据框作为数据表示与操作的事实标准结构。然而,目前最广泛使用的串行数据框(如R、pandas)在处理中等规模数据集时已出现性能瓶颈。我们认为,从高性能计算视角审视该问题存在显著的改进空间。在先前发表的论文中,我们提出了面向分布式数据框操作符的并行处理模式集及参考运行时实现Cylon[1]。本文通过引入评估所述模式的成本模型扩展了初始概念,并进一步在ORNL Summit超级计算机上评估了Cylon的性能。