Semantic Operators: A Declarative Model for Rich, AI-based Data Processing

The semantic capabilities of large language models (LLMs) have the potential to enable rich analytics and reasoning over vast knowledge corpora. Unfortunately, existing systems either empirically optimize expensive LLM-powered operations with no performance guarantees, or serve a limited set of row-wise LLM operations, providing limited robustness, expressiveness and usability. We introduce semantic operators, the first formalism for declarative and general-purpose AI-based transformations based on natural language specifications (e.g., filtering, sorting, joining or aggregating records using natural language criteria). Each operator opens a rich space for execution plans, similar to relational operators. Our model specifies the expected behavior of each operator with a high-quality gold algorithm, and we develop an optimization framework that reduces cost, while providing accuracy guarantees with respect to a gold algorithm. Using this approach, we propose several novel optimizations to accelerate semantic filtering, joining, group-by and top-k operations by up to $1,000\times$. We implement semantic operators in the LOTUS system and demonstrate LOTUS' effectiveness on real, bulk-semantic processing applications, including fact-checking, biomedical multi-label classification, search, and topic analysis. We show that the semantic operator model is expressive, capturing state-of-the-art AI pipelines in a few operator calls, and making it easy to express new pipelines that match or exceed quality of recent LLM-based analytic systems by up to $170\%$, while offering accuracy guarantees. Overall, LOTUS programs match or exceed the accuracy of state-of-the-art AI pipelines for each task while running up to $3.6\times$ faster than the highest-quality baselines. LOTUS is publicly available at https://github.com/lotus-data/lotus.

翻译：大型语言模型（LLM）的语义能力具有对海量知识库进行丰富分析与推理的潜力。然而，现有系统要么通过经验性方法优化昂贵的LLM操作而缺乏性能保证，要么仅支持有限的行级LLM操作，导致其鲁棒性、表达能力和可用性受限。本文提出语义算子——首个基于自然语言规范（例如：使用自然语言条件进行记录筛选、排序、连接或聚合）的声明式通用AI转换形式化模型。每个算子如同关系型算子般开辟了丰富的执行计划空间。我们的模型通过高质量黄金算法规范每个算子的预期行为，并开发了在保证黄金算法精度的同时降低成本的优化框架。基于该方法，我们提出了多项创新优化技术，将语义筛选、连接、分组和Top-k操作的执行速度提升高达$1,000$倍。我们在LOTUS系统中实现了语义算子，并在实际批量语义处理应用中验证了LOTUS的有效性，包括事实核查、生物医学多标签分类、搜索和主题分析。实验表明，语义算子模型具有强大的表达能力，仅需少量算子调用即可实现最先进的AI流程，并能轻松构建质量超越近期基于LLM的分析系统达$170\%$的新流程，同时提供精度保证。总体而言，LOTUS程序在各任务中达到或超越了最先进AI流程的精度，且运行速度比最高质量基线快达$3.6$倍。LOTUS已在https://github.com/lotus-data/lotus开源发布。