Beyond Linear LLM Invocation: An Efficient and Effective Semantic Filter Paradigm

Large language models (LLMs) are increasingly used for semantic query processing over large corpora. A set of semantic operators derived from relational algebra has been proposed to provide a unified interface for expressing such queries, among which the semantic filter operator serves as a cornerstone. Given a table T with a natural language predicate e, for each tuple in the relation, the execution of a semantic filter proceeds by constructing an input prompt that combines the predicate e with its content, querying the LLM, and obtaining the binary decision. However, this tuple-by-tuple evaluation necessitates a complete linear scan of the table, incurring prohibitive latency and token costs. Although recent work has attempted to optimize semantic filtering, it still does not break the linear LLM invocation barriers. To address this, we propose Clustering-Sampling-Voting (CSV), a new framework that reduces LLM invocations to sublinear complexity while providing error guarantees. CSV embeds tuples into semantic clusters, samples a small subset for LLM evaluation, and infers cluster-level labels via two proposed voting strategies: UniVote, which aggregates labels uniformly, and SimVote, which weights votes by semantic similarity. Moreover, CSV triggers re-clustering on ambiguous clusters to ensure robustness across diverse datasets. The results conducted on real-world datasets demonstrate that CSV reduces the number of LLM calls by 1.28-355x compared to the state-of-the-art approaches, while maintaining comparable effectiveness in terms of Accuracy and F1 score.

翻译：大型语言模型（LLM）越来越多地用于对大规模语料库进行语义查询处理。已有研究提出了一套源自关系代数的语义操作符，为表达此类查询提供了统一接口，其中语义过滤操作符是核心组件。给定一个包含自然语言谓词e的表T，语义过滤的执行过程是：为关系中的每个元组，构建一个将谓词e与其内容相结合的输入提示，查询LLM，并获取二元决策结果。然而，这种逐元组评估方式需要对表进行完整的线性扫描，导致极高的延迟和令牌成本。尽管近期研究尝试优化语义过滤，但仍未突破线性LLM调用的瓶颈。为此，我们提出了聚类-采样-投票（CSV）框架，该框架将LLM调用次数降至亚线性复杂度，同时提供误差保证。CSV将元组嵌入到语义聚类中，采样一个小子集进行LLM评估，并通过两种提出的投票策略推断聚类级标签：UniVote（均匀聚合标签）和SimVote（基于语义相似度加权投票）。此外，CSV会对模糊聚类触发重新聚类，以确保在不同数据集上的鲁棒性。在真实数据集上的实验结果表明，与最先进方法相比，CSV将LLM调用次数减少了1.28至355倍，同时在准确率和F1分数方面保持了相当的有效性。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

大语言模型智能体（LLM Agents）工具调用的演进：从单工具调用到多工具协同编排

专知会员服务

29+阅读 · 4月6日

从静态模板到动态运行时图：大语言模型智能体（LLM Agents）工作流优化综述

专知会员服务

23+阅读 · 3月30日

利用多个大型语言模型：关于LLM集成的调研

专知会员服务

35+阅读 · 2025年2月27日

【新书】解码大型语言模型：理解、实现与优化LLM在自然语言处理应用中的全面指南

专知会员服务

49+阅读 · 2024年12月13日