Evaluating relevance in large-scale search systems is fundamentally constrained by the governance gap between nuanced, resource-constrained human oversight and the high-throughput requirements of production systems. While traditional approaches rely on engagement proxies or sparse manual review, these methods often fail to capture the full scope of high-impact relevance failures. We present \textbf{SAGE} (Scalable AI Governance \& Evaluation), a framework that operationalizes high-quality human product judgment as a scalable evaluation signal. At the core of SAGE is a bidirectional calibration loop where natural-language \emph{Policy}, curated \emph{Precedent}, and an \emph{LLM Surrogate Judge} co-evolve. SAGE systematically resolves semantic ambiguities and misalignments, transforming subjective relevance judgment into an executable, multi-dimensional rubric with near human-level agreement. To bridge the gap between frontier model reasoning and industrial-scale inference, we apply teacher-student distillation to transfer high-fidelity judgments into compact student surrogates at \textbf{92$\times$} lower cost. Deployed within LinkedIn Search ecosystems, SAGE guided model iteration through simulation-driven development, distilling policy-aligned models for online serving and enabling rapid offline evaluation. In production, it powered policy oversight that measured ramped model variants and detected regressions invisible to engagement metrics. Collectively, these drove a \textbf{0.25\%} lift in LinkedIn daily active users.
翻译:在大规模搜索系统中,相关性的评估从根本上受制于治理鸿沟 —— 即在精细化且资源受限的人工监督与生产系统高吞吐需求之间存在矛盾。传统方法依赖参与度代理或稀疏的人工审核,但往往难以捕捉高影响力相关性失效的完整范围。我们提出 **SAGE**(可扩展的AI治理与评估),这一框架将高质量人类产品判断转化为可扩展的评估信号。SAGE的核心是一种双向校准循环,其中自然语言 *策略*、精心整理的 *先例* 与 *LLM替代裁判* 共同演化。SAGE系统性地消解语义模糊与对齐偏差,将主观相关性判断转化为可执行的、接近人类共识水平的多维度评估准则。为弥合前沿模型推理与工业级推理之间的鸿沟,我们采用教师-学生蒸馏技术,将高保真判断转移至紧凑型学生替代模型,成本降低 **92倍**。在领英搜索生态系统中部署后,SAGE通过仿真驱动开发指导模型迭代,提炼策略对齐模型用于在线服务,并实现快速离线评估。在生产环境中,它驱动策略监督,可度量分阶段上线的模型变体,并检测参与度指标无法察觉的回归问题。这些举措共同推动领英日活跃用户数提升 **0.25%**。