SAGE: Scalable AI Governance & Evaluation

Benjamin Le,Xueying Lu,Nick Stern,Wenqiong Liu,Igor Lapchuk,Xiang Li,Baofen Zheng,Kevin Rosenberg,Jiewen Huang,Zhe Zhang,Abraham Cabangbang,Satej Milind Wagle,Jianqiang Shen,Raghavan Muthuregunathan,Abhinav Gupta,Mathew Teoh,Andrew Kirk,Thomas Kwan,Jingwei Wu,Wenjing Zhang

Evaluating relevance in large-scale search systems is fundamentally constrained by the governance gap between nuanced, resource-constrained human oversight and the high-throughput requirements of production systems. While traditional approaches rely on engagement proxies or sparse manual review, these methods often fail to capture the full scope of high-impact relevance failures. We present \textbf{SAGE} (Scalable AI Governance \& Evaluation), a framework that operationalizes high-quality human product judgment as a scalable evaluation signal. At the core of SAGE is a bidirectional calibration loop where natural-language \emph{Policy}, curated \emph{Precedent}, and an \emph{LLM Surrogate Judge} co-evolve. SAGE systematically resolves semantic ambiguities and misalignments, transforming subjective relevance judgment into an executable, multi-dimensional rubric with near human-level agreement. To bridge the gap between frontier model reasoning and industrial-scale inference, we apply teacher-student distillation to transfer high-fidelity judgments into compact student surrogates at \textbf{92$\times$} lower cost. Deployed within LinkedIn Search ecosystems, SAGE guided model iteration through simulation-driven development, distilling policy-aligned models for online serving and enabling rapid offline evaluation. In production, it powered policy oversight that measured ramped model variants and detected regressions invisible to engagement metrics. Collectively, these drove a \textbf{0.25\%} lift in LinkedIn daily active users.

翻译：在大规模搜索系统中，相关性的评估从根本上受制于治理鸿沟 —— 即在精细化且资源受限的人工监督与生产系统高吞吐需求之间存在矛盾。传统方法依赖参与度代理或稀疏的人工审核，但往往难以捕捉高影响力相关性失效的完整范围。我们提出 **SAGE**（可扩展的AI治理与评估），这一框架将高质量人类产品判断转化为可扩展的评估信号。SAGE的核心是一种双向校准循环，其中自然语言 *策略*、精心整理的 *先例* 与 *LLM替代裁判* 共同演化。SAGE系统性地消解语义模糊与对齐偏差，将主观相关性判断转化为可执行的、接近人类共识水平的多维度评估准则。为弥合前沿模型推理与工业级推理之间的鸿沟，我们采用教师-学生蒸馏技术，将高保真判断转移至紧凑型学生替代模型，成本降低 **92倍**。在领英搜索生态系统中部署后，SAGE通过仿真驱动开发指导模型迭代，提炼策略对齐模型用于在线服务，并实现快速离线评估。在生产环境中，它驱动策略监督，可度量分阶段上线的模型变体，并检测参与度指标无法察觉的回归问题。这些举措共同推动领英日活跃用户数提升 **0.25%**。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

智能体评判者（Agent-as-a-Judge）研究综述

专知会员服务

37+阅读 · 1月9日

人工智能治理全景综述

专知会员服务

22+阅读 · 2025年8月13日

《防务领域人工智能可信赖性：为防务开发负责任、符合伦理且可信赖的AI系统》欧洲防务局2025最新107页

专知会员服务

23+阅读 · 2025年5月14日

《理解决策主体对可竞争人工智能系统的需求和感知》最新262页论文

专知会员服务

28+阅读 · 2025年4月14日