SAGE: Scalable AI Governance & Evaluation

Benjamin Le,Xueying Lu,Nick Stern,Wenqiong Liu,Igor Lapchuk,Xiang Li,Baofen Zheng,Kevin Rosenberg,Jiewen Huang,Zhe Zhang,Abraham Cabangbang,Satej Milind Wagle,Jianqiang Shen,Raghavan Muthuregunathan,Abhinav Gupta,Mathew Teoh,Andrew Kirk,Thomas Kwan,Jingwei Wu,Wenjing Zhang

Evaluating relevance in large-scale search systems is fundamentally constrained by the governance gap between nuanced, resource-constrained human oversight and the high-throughput requirements of production systems. While traditional approaches rely on engagement proxies or sparse manual review, these methods often fail to capture the full scope of high-impact relevance failures. We present \textbf{SAGE} (Scalable AI Governance \& Evaluation), a framework that operationalizes high-quality human product judgment as a scalable evaluation signal. At the core of SAGE is a bidirectional calibration loop where natural-language \emph{Policy}, curated \emph{Precedent}, and an \emph{LLM Surrogate Judge} co-evolve. SAGE systematically resolves semantic ambiguities and misalignments, transforming subjective relevance judgment into an executable, multi-dimensional rubric with near human-level agreement. To bridge the gap between frontier model reasoning and industrial-scale inference, we apply teacher-student distillation to transfer high-fidelity judgments into compact student surrogates at \textbf{92$\times$} lower cost. Deployed within LinkedIn Search ecosystems, SAGE guided model iteration through simulation-driven development, distilling policy-aligned models for online serving and enabling rapid offline evaluation. In production, it powered policy oversight that measured ramped model variants and detected regressions invisible to engagement metrics. Collectively, these drove a \textbf{0.25\%} lift in LinkedIn daily active users.

翻译：在大规模搜索系统中评估相关性，从根本上受制于精细但资源有限的人工监督与生产系统高吞吐量需求之间的治理鸿沟。传统方法依赖参与度代理指标或稀疏的人工审核，这些方法往往无法全面捕捉高影响力相关性失效的全貌。本文提出 **SAGE**（可扩展人工智能治理与评估）框架，该框架将高质量人工产品判断操作化为可扩展的评估信号。SAGE 的核心是一个双向校准循环，其中自然语言 **政策**、精选的 **先例** 以及 **LLM 代理评判器** 共同演化。SAGE 系统性地解决语义模糊与错位问题，将主观的相关性判断转化为可执行的、多维度评估准则，并达到接近人类水平的一致性。为弥合前沿模型推理能力与工业级推理规模之间的差距，我们应用师生蒸馏技术，将高保真判断转移至紧凑的学生代理模型中，成本降低 **92倍**。在 LinkedIn 搜索生态系统中部署后，SAGE 通过模拟驱动开发指导模型迭代，为在线服务提炼出政策对齐的模型，并实现快速离线评估。在生产环境中，它支撑了政策监督体系，用于衡量已上线模型变体的表现，并检测参与度指标无法发现的性能退化。总体而言，这些改进推动了 LinkedIn 日活跃用户数提升 **0.25%**。