Evaluating relevance in large-scale search systems is fundamentally constrained by the governance gap between nuanced, resource-constrained human oversight and the high-throughput requirements of production systems. While traditional approaches rely on engagement proxies or sparse manual review, these methods often fail to capture the full scope of high-impact relevance failures. We present \textbf{SAGE} (Scalable AI Governance \& Evaluation), a framework that operationalizes high-quality human product judgment as a scalable evaluation signal. At the core of SAGE is a bidirectional calibration loop where natural-language \emph{Policy}, curated \emph{Precedent}, and an \emph{LLM Surrogate Judge} co-evolve. SAGE systematically resolves semantic ambiguities and misalignments, transforming subjective relevance judgment into an executable, multi-dimensional rubric with near human-level agreement. To bridge the gap between frontier model reasoning and industrial-scale inference, we apply teacher-student distillation to transfer high-fidelity judgments into compact student surrogates at \textbf{92$\times$} lower cost. Deployed within LinkedIn Search ecosystems, SAGE guided model iteration through simulation-driven development, distilling policy-aligned models for online serving and enabling rapid offline evaluation. In production, it powered policy oversight that measured ramped model variants and detected regressions invisible to engagement metrics. Collectively, these drove a \textbf{0.25\%} lift in LinkedIn daily active users.
翻译:在大规模搜索系统中,相关性评估从根本上受到以下治理鸿沟的制约:一方面需要细致入微但资源有限的人工监督,另一方面生产系统又要求高吞吐量。传统方法依赖参与度代理指标或稀疏的人工审核,但这些方法往往无法全面捕捉高影响力相关性失效的完整范围。我们提出 **SAGE**(可扩展的人工智能治理与评估),这是一个将高质量人工产品判断操作化为可扩展评估信号的框架。SAGE的核心是一个双向校准循环,其中自然语言**策略**、精心策划的**先例**以及**LLM代理评判器**共同演化。SAGE系统地解决了语义模糊性和错位问题,将主观的相关性判断转化为可执行的、多维度评估准则,并达到接近人类水平的一致性。为了弥合前沿模型推理与工业规模推断之间的差距,我们应用师生蒸馏技术,将高保真判断转移到紧凑的学生代理模型中,成本降低 **92倍**。在LinkedIn搜索生态系统中部署后,SAGE通过模拟驱动开发指导模型迭代,为在线服务提炼出策略对齐的模型,并实现了快速的离线评估。在生产环境中,它提供了策略监督能力,用于衡量已上线模型变体的表现,并检测参与度指标无法发现的性能退化。综合而言,这些改进推动了LinkedIn日活跃用户数提升 **0.25%**。