The semantic capabilities of language models (LMs) have the potential to enable rich analytics and reasoning over vast knowledge corpora. Unfortunately, existing systems lack high-level abstractions to perform bulk semantic queries across large corpora. We introduce semantic operators, a declarative programming interface that extends the relational model with composable AI-based operations for bulk semantic queries (e.g., filtering, sorting, joining or aggregating records using natural language criteria). Each operator can be implemented and optimized in multiple ways, opening a rich space for execution plans similar to relational operators. We implement our operators in LOTUS, an open source query engine with a DataFrame API. Furthermore, we develop several novel optimizations that take advantage of the declarative nature of semantic operators to accelerate semantic filtering, clustering and join operators by up to $400\times$ while offering statistical accuracy guarantees. We demonstrate LOTUS' effectiveness on real AI applications including fact-checking, extreme multi-label classification, and search. We show that the semantic operator model is expressive, capturing state-of-the-art AI pipelines in a few operator calls, and making it easy to express new pipelines that achieve up to $180\%$ higher quality. Overall, LOTUS queries match or exceed the accuracy of state-of-the-art AI pipelines for each task while running up to 28$\times$ faster. LOTUS is publicly available at https://github.com/stanford-futuredata/lotus.
翻译:语言模型(LM)的语义能力有潜力实现对海量知识语料库的丰富分析与推理。然而,现有系统缺乏对大规模语料库执行批量语义查询的高层抽象。我们提出语义算子,这是一种声明式编程接口,它通过可组合的、基于人工智能的批量语义查询操作(例如,使用自然语言条件进行记录过滤、排序、连接或聚合)扩展了关系模型。每个算子可通过多种方式实现和优化,从而开辟了类似于关系算子的丰富执行计划空间。我们在LOTUS中实现了这些算子,这是一个具有DataFrame API的开源查询引擎。此外,我们开发了若干新颖的优化技术,利用语义算子的声明式特性,将语义过滤、聚类和连接算子的速度提升高达$400\times$,同时提供统计准确性保证。我们在实际人工智能应用中验证了LOTUS的有效性,包括事实核查、极端多标签分类和搜索。我们证明语义算子模型具有强表达力,仅需少量算子调用即可捕获最先进的人工智能流程,并能轻松构建质量提升高达$180\%$的新流程。总体而言,LOTUS查询在各项任务中达到或超越了最先进人工智能流程的准确率,同时运行速度提升高达28$\times$。LOTUS已在https://github.com/stanford-futuredata/lotus公开提供。