ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation

Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty-aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8-65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.

翻译：评估生成式AI模型正因推理速度慢、评估成本高以及模型与基准数量快速增长而日益消耗资源。我们提出ProEval，一种利用迁移学习来高效估计性能并识别故障案例的主动评估框架。ProEval采用预训练高斯过程（GPs）作为性能评分函数的替代模型，将模型输入映射至错误严重程度或安全违规等指标。通过将性能估计构建为贝叶斯求积（BQ）问题、将故障发现构建为超水平集采样问题，我们开发了不确定性感知的决策策略，能够主动选择或综合生成信息量丰富的测试输入。理论上，我们证明了基于预训练GP的BQ估计器具有无偏性与有界性。实验表明，在推理、安全对齐与分类基准测试上，ProEval的效能显著优于竞争基线，仅需8-65倍更少的样本即可达到与真实值相差1%以内的估计精度，同时在更严格的评估预算下能揭示更多样化的故障案例。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

智能体评判者（Agent-as-a-Judge）研究综述

专知会员服务

37+阅读 · 1月9日

AI 智能体系统：体系架构、应用场景及评估范式

专知会员服务

70+阅读 · 1月6日

美智库《获取生成式人工智能以提升美国防部影响力活动效能》最新报告

专知会员服务

24+阅读 · 2025年7月23日

文本、视觉与语音生成的自动化评估方法综述

专知会员服务

20+阅读 · 2025年6月15日