Fast LLM-Based Semantic Filtering: From a Unified Framework to an Adaptive Two-Phase Method

Evaluating a natural-language yes/no predicate over a document corpus under an accuracy target - the semantic filter - is a cornerstone of LLM-based data processing. Calling the LLM on every document (the oracle) is prohibitive, so cascades pair the oracle with a fast proxy. As deployed today, they leave four limitations on the table. (1) Each cascade family - model-free clustering, prebuilt small-LLM proxies, online-trained proxies - commits to a single representation and pipeline, and wins on only a narrow query regime. (2) The strongest online proxy invests in a custom training scheme on a bi-encoder over dense embeddings, missing the token-level evidence richer predicates require. (3) The proxy is trained against binary yes/no labels, wasting the LLM's per-document confidence at the boundary documents it most needs to learn. (4) Existing calibrations add a uniform safety margin, conflating genuine proxy uncertainty with small-sample noise and inflating cascade cost. We address these by (1) composing families adaptively - model-free clustering first, online proxy only when needed, with oracle calls shared across phases; (2) replacing the cosine bi-encoder with a hybrid of off-the-shelf token-aware models; (3) training the proxy with the oracle's per-document confidence as a soft label; and (4) a calibration that adds the safety margin only where the labeled sample is sparse. We are also the first to use the oracle's per-document confidence for three purposes: a query-level difficulty compass, a lower bound on the minimum oracle calls any proxy-based cascade can make, and the proxy's soft training label. At a 90% accuracy target on three 10K-document corpora, our methods are 1.6-2.0x faster than the best prior method per corpus and meet the target on 95% of queries; the BER-derived lower bound indicates a further ~4-20x of headroom for future work.

翻译：在给定准确率目标下，对文档语料库中的自然语言是非谓词进行评估——即语义过滤——是基于LLM的数据处理的核心任务。对每篇文档调用LLM（即标准解法）代价高昂，因此级联方法将标准解法与快速代理相结合。现有部署存在四方面不足：（1）每种级联家族——无模型聚类、预构建的小型LLM代理、在线训练的代理——仅承诺单一表示和流水线，且仅在狭窄的查询场景中表现优异；（2）最强的在线代理采用稠密嵌入上的双编码器进行定制训练方案，忽略了更丰富谓词所需的词级证据；（3）代理针对二值是非标签进行训练，浪费了LLM在边界文档上最需要学习的逐文档置信度；（4）现有校准方法增加均匀安全裕度，混淆了代理真正的不确定性与小样本噪声，推高了级联成本。我们通过以下方式解决这些问题：（1）自适应组合不同家族——首先使用无模型聚类，仅在必要时启用在线代理，且跨阶段共享标准解法的调用；（2）用现成的词感知模型混合体替代余弦双编码器；（3）以标准解法的逐文档置信度作为软标签训练代理；（4）仅在标注样本稀疏处增加安全裕度的校准方法。我们首次将标准解法的逐文档置信度用于三个目的：查询级难度指南、任何基于代理的级联所需最小标准解法调用次数的下界、以及代理的软训练标签。在三个各含1万篇文档的语料库上以90%准确率为目标时，我们的方法比每个语料库的最佳先前方法快1.6-2.0倍，且能在95%的查询上达到目标；基于BER导出的下界表明未来工作仍有约4-20倍的提升空间。