Bridge the Last-Mile Gap to Semantic Analytics: Compiling Natural-Language Queries into Semantic Operator Pipelines

Automated AI workflows increasingly rely on natural-language reasoning over heterogeneous data, but lack a practical way to execute it through optimized semantic data systems. Recent semantic operator systems, such as Palimpzest and LOTUS, expose declarative operators for filtering, joining, mapping, and aggregating over tables, text, and images using natural-language predicates. However, these systems require users to manually choose operators, order them, write predicates, and adapt the pipeline to backend-specific APIs. This is difficult for non-experts, brittle across backends, and infeasible for automated workflows where queries and data vary at runtime. We present NL2Pipe, a middleware system that compiles natural-language questions into executable semantic operator pipelines, treating this as a three-phase compilation problem. First, a Query-Data Linker grounds question entities against the actual data and discovers implicit bridge entities needed to connect tables, text, and images. Second, a Semantic Planner produces a backend-agnostic action plan of semantic operators and natural-language predicates. Third, a Code Generator translates the plan into executable code for a target backend using an auto-generated reference document capturing operator signatures, example pipelines, and backend constraints. This separates data-aware reasoning from backend-specific code generation, letting the same planning logic support multiple backends. Evaluation shows NL2Pipe substantially outperforms baselines on complex cross-source workloads (e.g., up to 60% higher F1) while maintaining bounded cost and competitive latency. This demonstrates that automatic compilation from natural language to semantic operator pipelines is both practical and effective for bringing semantic analytics to non-expert users and automated AI workflows.

翻译：自动化AI工作流日益依赖对异构数据进行自然语言推理，但缺乏通过优化的语义数据系统执行此类推理的实用方法。近期出现的语义算子系统（如Palimpzest和LOTUS）通过自然语言谓词，为表格、文本和图像的过滤、连接、映射和聚合操作提供了声明式算子。然而，这些系统要求用户手动选择算子、编排顺序、编写谓词，并根据后端特定API调整流水线。这对非专家用户难度高、跨后端适配易脆弱，且在查询与数据运行时变化的自动化工作流中不可行。本文提出NL2Pipe——一种将自然语言问题编译为可执行语义算子流水线的中间件系统，将其视为三阶段编译问题。首先，查询-数据链接器将问题实体锚定至实际数据，并发现连接表格、文本和图像所需的隐式桥接实体。其次，语义规划器生成由语义算子和自然语言谓词构成的、与后端无关的动作计划。最后，代码生成器利用自动生成的、包含算子签名、示例流水线及后端约束的参考文档，将计划转化为目标后端的可执行代码。该架构将数据感知推理与后端特定代码生成分离，使同一规划逻辑支持多个后端。评估表明，NL2Pipe在复杂跨源工作负载上的性能显著优于基线（例如F1值最高提升60%），同时保持可控成本与竞争性延迟。这证明从自然语言到语义算子流水线的自动编译切实可行且高效，能够将语义分析能力带给非专家用户与自动化AI工作流。