LOTUS: Enabling Semantic Queries with LLMs Over Tables of Unstructured and Structured Data

The semantic capabilities of language models (LMs) have the potential to enable rich analytics and reasoning over vast knowledge corpora. Unfortunately, existing systems lack high-level abstractions to perform semantic queries at scale. We introduce semantic operators, a declarative programming interface that extends the relational model with composable AI-based operations for semantic queries over datasets (e.g., sorting or aggregating records using natural language criteria). Each operator can be implemented and optimized in multiple ways, opening a rich space for execution plans similar to relational operators. We implement our operators and several optimizations for them in LOTUS, an open-source query engine with a Pandas-like API. We demonstrate LOTUS' effectiveness across a series of real applications, including fact-checking, extreme multi-label classification, and search. We find that LOTUS' programming model is highly expressive, capturing state-of-the-art query pipelines with low development overhead. Specifically, on the FEVER dataset, LOTUS' programs can reproduce FacTool, a recent state-of-the-art fact-checking pipeline, in few lines of code, and implement a new pipeline that improves accuracy by $9.5\%$, while offering $7-34\times$ lower execution time. In the extreme multi-label classification task on the BioDEX dataset, LOTUS reproduces state-of-the art result quality with its join operator, while providing an efficient algorithm that runs $800\times$ faster than a naive join. In the search and ranking application, LOTUS allows a simple composition of operators to achieve $5.9 - 49.4\%$ higher nDCG@10 than the vanilla retriever and re-ranker, while also providing query efficiency, with $1.67 - 10\times$ lower execution time than LM-based ranking methods used by prior works. LOTUS is publicly available at https://github.com/stanford-futuredata/lotus.

翻译：语言模型（LMs）的语义能力为在海量知识语料库上进行丰富的分析与推理提供了可能。然而，现有系统缺乏能够大规模执行语义查询的高层抽象。我们提出了语义算子，这是一种声明式编程接口，它通过可组合的基于人工智能的操作扩展了关系模型，用于对数据集进行语义查询（例如，使用自然语言条件对记录进行排序或聚合）。每个算子可以通过多种方式实现和优化，从而开辟了类似于关系算子的丰富执行计划空间。我们在LOTUS中实现了这些算子及其多种优化方案，LOTUS是一个具有类Pandas API的开源查询引擎。我们通过一系列实际应用（包括事实核查、极端多标签分类和搜索）验证了LOTUS的有效性。我们发现LOTUS的编程模型具有高度表现力，能够以较低的开发开销实现最先进的查询流程。具体而言，在FEVER数据集上，LOTUS程序可以用少量代码复现FacTool（一种最新的最先进事实核查流程），并实现一个新的流程，将准确率提高$9.5\%$，同时执行时间降低$7-34$倍。在BioDEX数据集的极端多标签分类任务中，LOTUS通过其连接算子复现了最先进的结果质量，同时提供了一种高效的算法，其运行速度比朴素连接快$800$倍。在搜索与排序应用中，LOTUS允许通过算子的简单组合，使nDCG@10比基础的检索器和重排序器提高$5.9 - 49.4\%$，同时提供了查询效率，其执行时间比先前工作中使用的基于LM的排序方法低$1.67 - 10$倍。LOTUS已在https://github.com/stanford-futuredata/lotus 公开提供。