In recent years, dataframe libraries, such as pandas have exploded in popularity. Due to their flexibility, they are increasingly used in ad-hoc exploratory data analysis (EDA) workloads. These workloads are diverse, including custom functions which can span libraries or be written in pure Python. The majority of systems available to accelerate EDA workloads focus on bulk-parallel workloads, which contain vastly different computational patterns, typically within a single library. As a result, they can introduce excessive overheads for ad-hoc EDA workloads due to their expensive optimization techniques. Instead, we identify program rewriting as a lightweight technique which can offer substantial speedups while also avoiding slowdowns. We implemented our techniques in Dias, which rewrites notebook cells to be more efficient for ad-hoc EDA workloads. We develop techniques for efficient rewrites in Dias, including dynamic checking of preconditions under which rewrites are correct and just-in-time rewrites for notebook environments. We show that Dias can rewrite individual cells to be 57$\times$ faster compared to pandas and 1909$\times$ faster compared to optimized systems such as modin. Furthermore, Dias can accelerate whole notebooks by up to 3.6$\times$ compared to pandas and 26.4$\times$ compared to modin.
翻译:近年来,像pandas这样的数据框库迅速普及。由于其灵活性,它们越来越多地用于即席探索性数据分析(EDA)工作负载。这些工作负载多样,包括可跨库或纯Python编写的自定义函数。现有用于加速EDA工作负载的大多数系统专注于批量并行工作负载,这些工作负载通常包含单一库内截然不同的计算模式。因此,它们可能因其昂贵的优化技术而给即席EDA工作负载带来过高的开销。相反,我们将程序重写视为一种轻量级技术,既能实现显著加速,又能避免性能降级。我们将这些技术实现在Dias中,它重写笔记本单元以提高即席EDA工作负载的效率。我们开发了Dias中的高效重写技术,包括动态检查重写正确的前提条件,以及针对笔记本环境的即时重写。实验表明,与pandas相比,Dias可将单个单元的速度提升57倍,与modin等优化系统相比提升1909倍。此外,相较于pandas,Dias可将整个笔记本加速高达3.6倍,相较于modin则加速26.4倍。