Query-efficient model evaluation using cached responses

Evaluating a new model on an existing benchmark is often necessary to understand its behavior before deployment. For modern evaluation frameworks, generating and evaluating a response for all queries can be prohibitively expensive. In practice, responses from previously-evaluated models are often cached -- creating a potential opportunity to use this additional information to decrease the number of queries required to accurately evaluate a new model. In this paper, we introduce an approach for predicting benchmark performance that leverages cached model responses based on the Data Kernel Perspective Space (DKPS), a method for quantifying the relationship between models in the black-box setting. Theoretically, we show that DKPS-based methods are query-efficient under certain conditions. Empirically, we demonstrate that DKPS-based methods achieve the same mean absolute error as baselines with a substantially decreased query budget. We conclude by proposing an offline method for selecting a set of queries that maximizes the goodness-of-fit on reference models, improving prediction accuracy over random query selection.

翻译：在部署新模型之前，通常需要基于现有基准对其进行评估，以理解其行为。对于现代评估框架而言，为所有查询生成并评估响应可能代价高昂。实际上，先前评估过的模型的响应往往被缓存——这为利用额外信息以减少准确评估新模型所需查询数量提供了潜在机会。本文提出了一种基于数据核感知空间（Data Kernel Perspective Space, DKPS）的基准性能预测方法，该方法通过量化黑盒设置下模型间的关系，利用缓存的模型响应进行预测。理论上，我们证明了基于DKPS的方法在特定条件下具有查询效率；实验上，我们展示了基于DKPS的方法在显著降低查询预算的情况下，能达到与基线方法相同的平均绝对误差。最后，我们提出了一种离线方法，通过选择能最大化参考模型拟合优度的查询集，相比随机查询选择进一步提高了预测精度。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

联邦学习中基础模型参数高效微调综述

专知会员服务

17+阅读 · 2025年5月5日

【斯坦福博士论文】大语言模型的AI辅助评估

专知会员服务

31+阅读 · 2025年3月30日

大型语言模型对齐技术综述：RLHF、RLAIF、PPO、DPO 等

专知会员服务

55+阅读 · 2024年7月24日

【COLING教程】导航现代评估领域：大语言模型 (LLMs) 基准和框架的考量，181页ppt

专知会员服务

28+阅读 · 2024年5月31日