The integration of Large Language Models (LLMs) into data analytics has unlocked powerful capabilities for reasoning over bulk structured and unstructured data. However, existing systems typically rely on either DataFrame primitives, which lack the efficient execution infrastructure of modern DBMSs, or SQL User-Defined Functions (UDFs), which isolate semantic logic from the query optimizer and burden users with implementation complexities. The LLM-powered semantic operators also bring new challenges due to the high cost and non-deterministic nature of LLM invocation, where conventional optimization rules and cost models are inapplicable for their optimization. To bridge these gaps, we present Sema, a high-performance semantic query engine built on DuckDB that treats LLM-powered semantic operators as first-class citizens. Sema introduces SemaSQL, a declarative dialect that allows users seamlessly inject natural language expressions into standard SQL clauses, enabling end-to-end optimization and execution. At the logical level, the optimizer of Sema compresses natural language expressions and deduces relational constraints from semantic operators. At runtime, Sema employs Adaptive Query Execution (AQE) to dynamically reorder operators, fuse semantic operations, and apply prompt batching. This approach seeks a Pareto-optimal execution path balancing token consumption and latency under accuracy constraints. We evaluate Sema on 20 semantic queries across classification, summarization, and extraction tasks. Experimental results demonstrate that Sema achieves $2-10 \times$ speedup against three baseline systems while achieving competitive result quality.
翻译:将大语言模型(LLMs)集成到数据分析中,为大规模结构化和非结构化数据的推理开启了强大能力。然而,现有系统通常依赖于两种方式:一是缺乏现代数据库管理系统高效执行基础设施的DataFrame原语;二是将语义逻辑与查询优化器隔离、并给用户带来实现复杂性的SQL用户定义函数(UDFs)。由LLM驱动的语义算子也带来了新的挑战,因为LLM调用成本高昂且具有不确定性,传统的优化规则和成本模型无法适用于其优化。为弥补这些不足,我们提出了Sema——一个基于DuckDB构建的高性能语义查询引擎,它将LLM驱动的语义算子视为一等公民。Sema引入了SemaSQL,这是一种声明式方言,允许用户将自然语言表达式无缝注入标准SQL子句中,实现端到端的优化与执行。在逻辑层面,Sema的优化器会压缩自然语言表达式,并从语义算子中推导关系约束。在运行时,Sema采用自适应查询执行(AQE)技术动态重排算子、融合语义操作并应用提示词批处理。该方法在准确度约束下,寻求平衡令牌消耗与延迟的帕累托最优执行路径。我们在分类、摘要和提取三类任务共20个语义查询上对Sema进行评估。实验结果表明,Sema在取得有竞争力的结果质量的同时,相比三个基线系统实现了$2-10 \times$的加速。