Evaluating a natural-language yes/no predicate over a document corpus under an accuracy target - the semantic filter - is a cornerstone of LLM-based data processing. Calling the LLM on every document (the oracle) is prohibitive, so cascades pair the oracle with a fast proxy. As deployed today, they leave four limitations on the table. (1) Each cascade family - model-free clustering, prebuilt small-LLM proxies, online-trained proxies - commits to a single representation and pipeline, and wins on only a narrow query regime. (2) The strongest online proxy invests in a custom training scheme on a bi-encoder over dense embeddings, missing the token-level evidence richer predicates require. (3) The proxy is trained against binary yes/no labels, wasting the LLM's per-document confidence at the boundary documents it most needs to learn. (4) Existing calibrations add a uniform safety margin, conflating genuine proxy uncertainty with small-sample noise and inflating cascade cost. We address these by (1) composing families adaptively - model-free clustering first, online proxy only when needed, with oracle calls shared across phases; (2) replacing the cosine bi-encoder with a hybrid of off-the-shelf token-aware models; (3) training the proxy with the oracle's per-document confidence as a soft label; and (4) a calibration that adds the safety margin only where the labeled sample is sparse. We are also the first to use the oracle's per-document confidence for three purposes: a query-level difficulty compass, a lower bound on the minimum oracle calls any proxy-based cascade can make, and the proxy's soft training label. At a 90% accuracy target on three 10K-document corpora, our methods are 1.6-2.0x faster than the best prior method per corpus and meet the target on 95% of queries; the BER-derived lower bound indicates a further ~4-20x of headroom for future work.
翻译:在给定准确率目标下,对文档语料库中的自然语言是非谓词进行评估——即语义过滤——是基于LLM的数据处理的核心任务。对每篇文档调用LLM(即标准解法)代价高昂,因此级联方法将标准解法与快速代理相结合。现有部署存在四方面不足:(1)每种级联家族——无模型聚类、预构建的小型LLM代理、在线训练的代理——仅承诺单一表示和流水线,且仅在狭窄的查询场景中表现优异;(2)最强的在线代理采用稠密嵌入上的双编码器进行定制训练方案,忽略了更丰富谓词所需的词级证据;(3)代理针对二值是非标签进行训练,浪费了LLM在边界文档上最需要学习的逐文档置信度;(4)现有校准方法增加均匀安全裕度,混淆了代理真正的不确定性与小样本噪声,推高了级联成本。我们通过以下方式解决这些问题:(1)自适应组合不同家族——首先使用无模型聚类,仅在必要时启用在线代理,且跨阶段共享标准解法的调用;(2)用现成的词感知模型混合体替代余弦双编码器;(3)以标准解法的逐文档置信度作为软标签训练代理;(4)仅在标注样本稀疏处增加安全裕度的校准方法。我们首次将标准解法的逐文档置信度用于三个目的:查询级难度指南、任何基于代理的级联所需最小标准解法调用次数的下界、以及代理的软训练标签。在三个各含1万篇文档的语料库上以90%准确率为目标时,我们的方法比每个语料库的最佳先前方法快1.6-2.0倍,且能在95%的查询上达到目标;基于BER导出的下界表明未来工作仍有约4-20倍的提升空间。