Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty-aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8-65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.
翻译:评估生成式AI模型正因推理速度慢、评估成本高以及模型与基准数量快速增长而日益消耗资源。我们提出ProEval,一种利用迁移学习来高效估计性能并识别故障案例的主动评估框架。ProEval采用预训练高斯过程(GPs)作为性能评分函数的替代模型,将模型输入映射至错误严重程度或安全违规等指标。通过将性能估计构建为贝叶斯求积(BQ)问题、将故障发现构建为超水平集采样问题,我们开发了不确定性感知的决策策略,能够主动选择或综合生成信息量丰富的测试输入。理论上,我们证明了基于预训练GP的BQ估计器具有无偏性与有界性。实验表明,在推理、安全对齐与分类基准测试上,ProEval的效能显著优于竞争基线,仅需8-65倍更少的样本即可达到与真实值相差1%以内的估计精度,同时在更严格的评估预算下能揭示更多样化的故障案例。