Predicates are foundational components in data analysis systems. However, modern workloads increasingly involve unstructured documents, which demands semantic understanding, beyond traditional value-based predicates. Given enormous documents and ad-hoc queries, while Large Language Models (LLMs) demonstrate powerful zero-shot capabilities, their high inference cost leads to unacceptable overhead. Therefore, we introduce \textsc{ScaleDoc}, a novel system that addresses this by decoupling predicate execution into an offline representation phase and an optimized online filtering phase. In the offline phase, \textsc{ScaleDoc} leverages a LLM to generate semantic representations for each document. Online, for each query, it trains a lightweight proxy model on these representations to filter the majority of documents, forwarding only the ambiguous cases to the LLM for final decision. Furthermore, \textsc{ScaleDoc} proposes two core innovations to achieve significant efficiency: (1) a contrastive-learning-based framework that trains the proxy model to generate reliable predicating decision scores; (2) an adaptive cascade mechanism that determines the effective filtering policy while meeting specific accuracy targets. Our evaluations across three datasets demonstrate that \textsc{ScaleDoc} achieves over a 2$\times$ end-to-end speedup and reduces expensive LLM invocations by up to 85\%, making large-scale semantic analysis practical and efficient.
翻译:谓词是数据分析系统中的基础组件。然而,现代工作负载越来越多地涉及非结构化文档,这需要超越传统基于值的谓词的语义理解能力。面对海量文档和即席查询,尽管大语言模型展现出强大的零样本能力,但其高昂的推理成本导致了难以接受的开销。为此,我们提出了 \textsc{ScaleDoc},这是一个新颖的系统,通过将谓词执行解耦为离线表示阶段和优化的在线过滤阶段来解决此问题。在离线阶段,\textsc{ScaleDoc} 利用大语言模型为每个文档生成语义表示。在线阶段,对于每个查询,它基于这些表示训练一个轻量级代理模型来过滤大部分文档,仅将模糊案例转发给大语言模型进行最终决策。此外,\textsc{ScaleDoc} 提出了两项核心创新以实现显著的效率提升:(1) 一个基于对比学习的框架,用于训练代理模型生成可靠的谓词决策分数;(2) 一种自适应级联机制,可在满足特定精度目标的同时确定有效的过滤策略。我们在三个数据集上的评估表明,\textsc{ScaleDoc} 实现了超过 2$\times$ 的端到端加速,并将昂贵的大语言模型调用减少了高达 85\%,使得大规模语义分析变得实用且高效。