Multi-Objective Agentic Rewrites for Unstructured Data Processing

One year ago, we open-sourced DocETL, a declarative system for LLM-powered data processing that, as of March 2026, has 3.7K GitHub stars and users across domains (e.g., journalism, law, medicine, policy, finance, and urban planning). In DocETL, users build pipelines by composing operators described in natural language, also known as semantic operators, with an LLM executing each operator's logic. However, due to complexity in the operator or the data it operates on, LLMs often give inaccurate results. To address this challenge, DocETL introduced rewrite directives, or abstract rules that guide LLM agents in rewriting pipelines by decomposing operators or data. For example, decomposing a single filter("is this email sent from an executive and discussing fraud?") into the conjunction of two separate semantic filters may improve accuracy. However, DocETL only optimizes for accuracy, not cost. How do we optimize for both? We present MOAR (Multi-Objective Agentic Rewrites), a new optimizer for DocETL. To target cost optimization, we introduce two new categories of directives and extend all three existing categories with new ones, bringing the total to over 30 directives -- more than doubling what DocETL originally had. Moreover, since operators can interact with each other unpredictably due to LLM behavior, optimizing operators or sub-pipelines individually can yield suboptimal overall plans. Recognizing this, we design a new global search algorithm that explores rewrites in the context of entire pipelines. Since the space of rewrites is infinite -- pipelines can be rewritten in many ways, and each rewritten pipeline can itself be rewritten -- our algorithm adapts a multi-armed bandit framework to prioritize which pipelines to rewrite. Across six workloads, MOAR achieves 27% higher accuracy than ABACUS, the next-best optimizer, while matching its best accuracy at 55% of its cost.

翻译：一年前，我们开源了DocETL——一个基于大型语言模型（LLM）的声明式数据处理系统。截至2026年3月，该系统已在GitHub上获得3700颗星，用户覆盖新闻、法律、医学、政策、金融和城市规划等多个领域。在DocETL中，用户通过组合由自然语言描述的算子（即语义算子）来构建流水线，并由LLM执行每个算子的逻辑。然而，由于算子或其处理数据的复杂性，LLM常常给出不准确的结果。为解决这一挑战，DocETL引入了重写指令——即引导LLM智能体通过分解算子或数据来重写流水线的抽象规则。例如，将单个过滤器（"此邮件是否由高管发送并讨论欺诈问题？"）分解为两个独立语义过滤器的合取，可提升准确性。但DocETL仅优化准确性，未考虑成本。如何同时优化两者？我们提出MOAR（多目标智能重写），一种针对DocETL的新型优化器。为聚焦成本优化，我们引入两类新指令，并将所有现有三大类指令扩展新成员，使指令总数超过30条——较DocETL原有数量翻倍以上。此外，由于LLM行为的不可预测性，算子间可能产生非预期的交互，单独优化算子或子流水线可能导致全局方案欠优。基于此，我们设计了一种新型全局搜索算法，在完整流水线的上下文中探索重写。由于重写空间无限——流水线可被多种方式重写，且每个重写后的流水线自身又可被进一步重写——我们的算法采用多臂老虎机框架来优先选择需重写的流水线。在六类工作负载中，MOAR在准确性上较次优优化器ABACUS提升27%，同时以55%的成本达到其最佳准确性水平。