Semantic query processing engines often support semantic joins, enabling users to match rows that satisfy conditions specified in natural language. Such join conditions can be evaluated using large language models (LLMs) that solve novel tasks without task-specific training. Currently, many semantic query processing engines implement semantic joins via nested loops, invoking the LLM to evaluate the join condition on row pairs. Instead, this paper proposes a novel algorithm, inspired by the block nested loops join operator implementation in traditional database systems. The proposed algorithm integrates batches of rows from both input tables into a single prompt. The goal of the LLM invocation is to identify all matching row pairs in the current input. The paper introduces formulas that can be used to optimize the size of the row batches, taking into account constraints on the size of the LLM context window (limiting both input and output size). An adaptive variant of the proposed algorithm refers to cases in which the size of the output is difficult to estimate. A formal analysis of asymptotic processing costs, as well as empirical results, demonstrates that the proposed approach reduces costs significantly and performs well compared to join implementations used by recent semantic query processing engines.
翻译:语义查询处理引擎通常支持语义连接,使用户能够匹配满足自然语言指定条件的行。此类连接条件可通过大型语言模型(LLM)进行评估,这些模型无需针对特定任务进行训练即可解决新型任务。目前,许多语义查询处理引擎通过嵌套循环实现语义连接,调用LLM对行对评估连接条件。本文提出一种受传统数据库系统中块嵌套循环连接算子实现启发的新算法。该算法将来自两个输入表的行批次整合至单个提示中,通过调用LLM识别当前输入中的所有匹配行对。本文引入可优化行批次规模的公式,该公式综合考虑了LLM上下文窗口大小(同时限制输入和输出规模)的约束条件。针对输出规模难以预估的情况,提出了该算法的自适应变体。通过渐进处理成本的形式化分析及实证结果表明,相较于现有语义查询处理引擎采用的连接实现方案,所提方法能显著降低成本并表现出优越性能。