Pandas is widely used for data science applications, but users often run into problems when datasets are larger than memory. There are several frameworks based on lazy evaluation that handle large datasets, but the programs have to be rewritten to suit the framework, and the presence of multiple frameworks complicates the life of a programmer. In this paper we present a framework that allows programmers to code in plain Pandas; with just two lines of code changed by the user, our system optimizes the program using a combination of just-in-time static analysis, and runtime optimization based on a lazy dataframe wrapper framework. Moreover, our system allows the programmer to choose the backend. It works seamlessly with Pandas, Dask, and Modin, allowing the choice of the best-suited backend for an application based on factors such as data size. Performance results on a variety of programs show the benefits of our framework.
翻译:Pandas 在数据科学领域应用广泛,但当数据集超出内存容量时,用户常会遇到问题。目前已有若干基于惰性求值的框架可处理大规模数据集,但用户必须重写程序以适应框架,且多种框架并存增加了程序员的负担。本文提出一种允许程序员使用原生 Pandas 编程的框架:用户仅需修改两行代码,本系统即可通过即时静态分析与基于惰性数据框封装框架的运行时优化相结合的方式对程序进行优化。此外,本系统支持程序员自主选择后端执行引擎,能够与 Pandas、Dask 及 Modin 无缝协作,便于根据数据规模等因素为应用选择最合适的后端。在多样化程序上的性能测试结果验证了本框架的优越性。