Large language models (LLMs) are increasingly used for semantic query processing over large corpora. A set of semantic operators derived from relational algebra has been proposed to provide a unified interface for expressing such queries, among which the semantic filter operator serves as a cornerstone. Given a table T with a natural language predicate e, for each tuple in the relation, the execution of a semantic filter proceeds by constructing an input prompt that combines the predicate e with its content, querying the LLM, and obtaining the binary decision. However, this tuple-by-tuple evaluation necessitates a complete linear scan of the table, incurring prohibitive latency and token costs. Although recent work has attempted to optimize semantic filtering, it still does not break the linear LLM invocation barriers. To address this, we propose Clustering-Sampling-Voting (CSV), a new framework that reduces LLM invocations to sublinear complexity while providing error guarantees. CSV embeds tuples into semantic clusters, samples a small subset for LLM evaluation, and infers cluster-level labels via two proposed voting strategies: UniVote, which aggregates labels uniformly, and SimVote, which weights votes by semantic similarity. Moreover, CSV triggers re-clustering on ambiguous clusters to ensure robustness across diverse datasets. The results conducted on real-world datasets demonstrate that CSV reduces the number of LLM calls by 1.28-355x compared to the state-of-the-art approaches, while maintaining comparable effectiveness in terms of Accuracy and F1 score.
翻译:大型语言模型(LLM)越来越多地用于对大规模语料库进行语义查询处理。已有研究提出了一套源自关系代数的语义操作符,为表达此类查询提供了统一接口,其中语义过滤操作符是核心组件。给定一个包含自然语言谓词e的表T,语义过滤的执行过程是:为关系中的每个元组,构建一个将谓词e与其内容相结合的输入提示,查询LLM,并获取二元决策结果。然而,这种逐元组评估方式需要对表进行完整的线性扫描,导致极高的延迟和令牌成本。尽管近期研究尝试优化语义过滤,但仍未突破线性LLM调用的瓶颈。为此,我们提出了聚类-采样-投票(CSV)框架,该框架将LLM调用次数降至亚线性复杂度,同时提供误差保证。CSV将元组嵌入到语义聚类中,采样一个小子集进行LLM评估,并通过两种提出的投票策略推断聚类级标签:UniVote(均匀聚合标签)和SimVote(基于语义相似度加权投票)。此外,CSV会对模糊聚类触发重新聚类,以确保在不同数据集上的鲁棒性。在真实数据集上的实验结果表明,与最先进方法相比,CSV将LLM调用次数减少了1.28至355倍,同时在准确率和F1分数方面保持了相当的有效性。