Relational databases excel at structured data analysis, but real-world queries increasingly require capabilities beyond standard SQL, such as semantically matching entities across inconsistent names, extracting information not explicitly stored in schemas, and analyzing unstructured text. While text-to-SQL systems enable natural language querying, they remain limited to relational operations and cannot leverage the semantic reasoning capabilities of modern large language models (LLMs). Conversely, recent semantic operator systems extend relational algebra with LLM-powered operations (e.g., semantic joins, mappings, aggregations), but require users to manually construct complex query pipelines. To address this gap, we present SEMA-SQL, a system that automatically answers natural language questions by generating efficient queries that combine relational operations with LLM semantic reasoning. We formalize Hybrid Relational Algebra (HRA), a declarative abstraction unifying traditional relational operators with LLM user-defined functions (UDFs). The system automates three critical aspects: (1) query generation via in-context learning that produces HRA queries with precise natural language specifications for LLM UDFs, (2) query optimization via cost-based transformations and UDF rewriting, and (3) efficient execution algorithms that reduce LLM invocations by an average of 93% in semantic joins through intelligent batching. Extensive experiments with known benchmarks, and extensions thereof, demonstrate the significant query capability improvements possible with our design.
翻译:关系数据库擅长结构化数据分析,但现实世界的查询日益需要超越标准SQL的能力,例如跨不一致名称的语义实体匹配、提取未显式存储在模式中的信息,以及分析非结构化文本。尽管文本到SQL系统支持自然语言查询,但它们仍局限于关系操作,无法利用现代大语言模型(LLMs)的语义推理能力。相反,近期语义算子系统通过基于LLM的操作(例如语义连接、映射、聚合)扩展了关系代数,但要求用户手动构建复杂的查询管线。为填补这一空白,我们提出了SEMA-SQL,一个通过生成结合关系操作与LLM语义推理的高效查询来自动解答自然语言问题的系统。我们形式化了混合关系代数(HRA),一种统一传统关系算子与LLM用户自定义函数(UDFs)的声明式抽象。该系统自动化三个关键方面:(1)基于上下文学习的查询生成,生成带有LLM UDFs精确自然语言规范的HRA查询;(2)基于代价的转换与UDF重写的查询优化;(3)通过智能批处理将语义连接中的LLM调用平均减少93%的高效执行算法。通过已知基准及其扩展的广泛实验,证明了我们的设计能够显著提升查询能力。