Optimizing Pandas programs is a challenging problem. Existing systems and compiler-based approaches offer reliability but are either heavyweight or support only a limited set of optimizations. Conversely, using LLMs in a per-program optimization methodology can synthesize nontrivial optimizations, but is unreliable, expensive, and offers a low yield. In this work, we introduce a hybrid approach that works in a 3-stage manner that decouples discovery from deployment and connects them via a novel bridge. First, it discovers per-program optimizations (discovery). Second, they are converted into generalised rewrite rules (bridge). Finally, these rules are incorporated into a compiler that can automatically apply them wherever applicable, eliminating repeated reliance on LLMs (deployment). We demonstrate that RuleFlow is the new state-of-the-art (SOTA) Pandas optimization framework on PandasBench, a challenging Pandas benchmark consisting of Python notebooks. Across these notebooks, we achieve a speedup of up to 4.3x over Dias, the previous compiler-based SOTA, and 1914.9x over Modin, the previous systems-based SOTA. Our code is available at https://github.com/ADAPT-uiuc/RuleFlow.
翻译:优化Pandas程序是一个具有挑战性的问题。现有系统和基于编译器的方法提供了可靠性,但要么过于笨重,要么仅支持有限的优化集合。相反,在逐程序优化方法中使用大型语言模型(LLMs)可以合成非平凡的优化,但这种方法不可靠、成本高昂且产出率低。在本工作中,我们引入了一种混合方法,该方法以三阶段方式工作,将发现与部署解耦,并通过一种新颖的桥梁将它们连接起来。首先,它发现逐程序优化(发现阶段)。其次,将这些优化转换为通用的重写规则(桥梁阶段)。最后,将这些规则整合到一个编译器中,该编译器可以自动在适用之处应用它们,从而消除对LLMs的重复依赖(部署阶段)。我们证明,在PandasBench(一个由Python笔记本组成的具有挑战性的Pandas基准测试集)上,RuleFlow是新的最先进的Pandas优化框架。在这些笔记本中,我们实现了相对于先前基于编译器的最先进方法Dias最高4.3倍的加速,以及相对于先前基于系统的最先进方法Modin最高1914.9倍的加速。我们的代码可在 https://github.com/ADAPT-uiuc/RuleFlow 获取。