Fast LLM-Based Semantic Filtering: From a Unified Framework to an Adaptive Two-Phase Method

Evaluating a natural-language yes/no predicate over a document corpus under an accuracy target - the semantic filter - is a cornerstone of LLM-based data processing. Calling the LLM on every document (the oracle) is prohibitive, so cascades pair the oracle with a fast proxy. As deployed today, they leave four limitations on the table. (1) Each cascade family - model-free clustering, prebuilt small-LLM proxies, online-trained proxies - commits to a single representation and pipeline, and wins on only a narrow query regime. (2) The strongest online proxy invests in a custom training scheme on a bi-encoder over dense embeddings, missing the token-level evidence richer predicates require. (3) The proxy is trained against binary yes/no labels, wasting the LLM's per-document confidence at the boundary documents it most needs to learn. (4) Existing calibrations add a uniform safety margin, conflating genuine proxy uncertainty with small-sample noise and inflating cascade cost. We address these by (1) composing families adaptively - model-free clustering first, online proxy only when needed, with oracle calls shared across phases; (2) replacing the cosine bi-encoder with a hybrid of off-the-shelf token-aware models; (3) training the proxy with the oracle's per-document confidence as a soft label; and (4) a calibration that adds the safety margin only where the labeled sample is sparse. We are also the first to use the oracle's per-document confidence for three purposes: a query-level difficulty compass, a lower bound on the minimum oracle calls any proxy-based cascade can make, and the proxy's soft training label. At a 90% accuracy target on three 10K-document corpora, our methods are 1.6-2.0x faster than the best prior method per corpus and meet the target on 95% of queries; the BER-derived lower bound indicates a further ~4-20x of headroom for future work.

翻译：在文档语料库上评估自然语言的是/否断言并满足准确率目标——即语义过滤——是基于大语言模型的数据处理的基石。对每一篇文档调用大语言模型（即预言机）成本高昂，因此级联系统将预言机与快速代理配对。当前部署的级联系统存在四个局限性：（1）每种级联家族——无模型聚类、预构建的小型大语言模型代理、在线训练的代理——都局限于单一表示和流程，仅在狭窄的查询区间内表现优异。（2）最强的在线代理在密集嵌入上的双编码器上投入了定制训练方案，忽略了需要更丰富谓词所需的词元级证据。（3）代理针对二元的“是/否”标签进行训练，浪费了大语言模型在最需要学习的边界文档上的逐文档置信度。（4）现有的校准方法添加了统一的安全裕度，将真正的代理不确定性与小样本噪声混淆，增大了级联成本。我们通过以下方式解决这些问题：（1）自适应地组合家族——先进行无模型聚类，仅在需要时调用在线代理，并在各阶段共享预言机调用；（2）用现成的词元感知模型的混合体替换余弦双编码器；（3）使用预言机的逐文档置信度作为软标签来训练代理；（4）采用仅在标记样本稀疏处添加安全裕度的校准方法。我们也是首次将预言机的逐文档置信度用于三个目的：作为查询级别的难度指南针、作为任何基于代理的级联所需最小预言机调用的下限、以及代理的软训练标签。在三个包含10K文档的语料库上，以90%的准确率为目标，我们的方法比每个语料库上先前最佳方法快1.6-2.0倍，并在95%的查询上达到目标；基于RBER（Rank-Based Expected Reduction）导出的下限表明未来工作尚有约4-20倍的提升空间。