DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing

Analyzing unstructured data, such as complex documents, has been a persistent challenge in data processing. Large Language Models (LLMs) have shown promise in this regard, leading to recent proposals for declarative frameworks for LLM-powered unstructured data processing. However, these frameworks focus on reducing cost when executing user-specified operations using LLMs, rather than improving accuracy, executing most operations as-is. This is problematic for complex tasks and data, where LLM outputs for user-defined operations are often inaccurate, even with optimized prompts. We present DocETL, a system that optimizes complex document processing pipelines, while accounting for LLM shortcomings. DocETL offers a declarative interface for users to define such pipelines and uses an agent-based framework to automatically optimize them, leveraging novel agent-based rewrites (that we call {\em rewrite directives}) and an optimization and evaluation framework that we introduce. We introduce {\em (i)} logical rewriting of pipelines, tailored for LLM-based tasks, {\em (ii)} an agent-guided plan evaluation mechanism that synthesizes and orchestrates task-specific validation prompts, and {\em (iii)} an optimization algorithm that efficiently finds promising plans, considering the time constraints of LLM-based plan generation and evaluation. Our evaluation on three different unstructured document analysis tasks demonstrates that DocETL finds plans with outputs that are $1.34$ to $4.6\times$ higher quality (e.g., more accurate, comprehensive) than well-engineered baselines, addressing a critical gap in existing declarative frameworks for unstructured data analysis. DocETL is open-source at \ttt{docetl.org}, and as of October 2024, has amassed over 800 GitHub Stars, with users spanning a variety of domains.

翻译：分析非结构化数据（如复杂文档）一直是数据处理领域的一项持久挑战。大型语言模型（LLM）在这方面展现出潜力，近期也出现了基于LLM的非结构化数据处理的声明式框架提案。然而，这些框架侧重于降低使用LLM执行用户指定操作时的成本，而非提高准确性，大多直接按原样执行操作。这对于复杂任务和数据而言存在问题，因为即使用户定义的操作经过提示词优化，LLM的输出也常常不准确。本文提出DocETL系统，该系统在考虑LLM缺陷的同时，对复杂文档处理流程进行优化。DocETL为用户提供了定义此类流程的声明式接口，并采用基于智能体的框架自动优化流程，其核心在于利用新型的基于智能体的重写机制（我们称之为{\em 重写指令}）以及我们引入的优化与评估框架。我们提出了{\em (i)} 针对基于LLM任务定制的流程逻辑重写方法，{\em (ii)} 一种智能体引导的计划评估机制，该机制能合成并编排面向特定任务的验证提示，以及{\em (iii)} 一种优化算法，该算法在考虑基于LLM的计划生成与评估的时间约束下，高效地寻找有前景的执行计划。我们在三种不同的非结构化文档分析任务上的评估表明，DocETL找到的执行计划，其输出质量（例如准确性、全面性）比精心设计的基线方法高出$1.34$至$4.6$倍，从而弥补了现有非结构化数据分析声明式框架中的一个关键空白。DocETL已在\ttt{docetl.org}开源，截至2024年10月，已获得超过800个GitHub星标，用户遍布多个领域。