Evaluating a new model on an existing benchmark is often necessary to understand its behavior before deployment. For modern evaluation frameworks, generating and evaluating a response for all queries can be prohibitively expensive. In practice, responses from previously-evaluated models are often cached -- creating a potential opportunity to use this additional information to decrease the number of queries required to accurately evaluate a new model. In this paper, we introduce an approach for predicting benchmark performance that leverages cached model responses based on the Data Kernel Perspective Space (DKPS), a method for quantifying the relationship between models in the black-box setting. Theoretically, we show that DKPS-based methods are query-efficient under certain conditions. Empirically, we demonstrate that DKPS-based methods achieve the same mean absolute error as baselines with a substantially decreased query budget. We conclude by proposing an offline method for selecting a set of queries that maximizes the goodness-of-fit on reference models, improving prediction accuracy over random query selection.
翻译:在部署新模型之前,通常需要基于现有基准对其进行评估,以理解其行为。对于现代评估框架而言,为所有查询生成并评估响应可能代价高昂。实际上,先前评估过的模型的响应往往被缓存——这为利用额外信息以减少准确评估新模型所需查询数量提供了潜在机会。本文提出了一种基于数据核感知空间(Data Kernel Perspective Space, DKPS)的基准性能预测方法,该方法通过量化黑盒设置下模型间的关系,利用缓存的模型响应进行预测。理论上,我们证明了基于DKPS的方法在特定条件下具有查询效率;实验上,我们展示了基于DKPS的方法在显著降低查询预算的情况下,能达到与基线方法相同的平均绝对误差。最后,我们提出了一种离线方法,通过选择能最大化参考模型拟合优度的查询集,相比随机查询选择进一步提高了预测精度。