Big data programming frameworks have become increasingly important for the development of applications for which performance and scalability are critical. In those complex frameworks, optimizing code by hand is hard and time-consuming, making automated optimization particularly necessary. In order to automate optimization, a prerequisite is to find suitable abstractions to represent programs; for instance, algebras based on monads or monoids to represent distributed data collections. Currently, however, such algebras do not represent recursive programs in a way which allows for analyzing or rewriting them. In this paper, we extend a monoid algebra with a fixpoint operator for representing recursion as a first class citizen and show how it enables new optimizations. Experiments with the Spark platform illustrate performance gains brought by these systematic optimizations.
翻译:大数据编程框架对于性能和可扩展性至关重要的应用开发日益重要。在这些复杂框架中,手动优化代码既困难又耗时,这使得自动化优化尤为必要。为实现自动化优化,前提是找到合适的抽象来表示程序,例如基于幺半群或独异点(monads/monoids)的代数来表示分布式数据集合。然而,当前这类代数无法以支持分析或重写的方式表示递归程序。本文扩展了一种带有不动点算子的幺半群代数,将递归作为一等公民加以表示,并展示了其如何实现新的优化。基于Spark平台的实验证明了这些系统性优化所带来的性能提升。