Efficient benchmarking techniques aim to lower the computational cost of evaluating LLMs by predicting full benchmark scores using only a subset of a benchmark's questions. By reframing this problem as an instance of multiple regression with feature selection, we find that existing efficient benchmarking methods can be greatly improved by simply using kernel ridge regression at the prediction stage. Additionally, using an information-theoretic feature-selection algorithm called minimum redundancy maximum relevance (mRMR), we can further improve upon these methods by selecting question subsets that will be maximally useful for prediction. Except in very data-poor settings, these approaches consistently achieve smaller prediction errors (in both MAE and RMSE), and greater ranking correlation between predicted and true scores (in both Spearman $ρ$ and Kendall $τ$) across a range of benchmarks using both binary and continuous metrics. Furthermore, mRMR subsampling is much faster than competitor methods (which often involve fitting probabilistic models or running clustering algorithms), and is more likely to select the same questions under different random seeds or training data splits. Tutorial code can be found at https://github.com/sambowyer/mrmr_eval .
翻译:高效基准测试技术旨在通过仅使用基准测试问题的子集来预测完整基准测试分数,从而降低评估大语言模型的计算成本。通过将该问题重新定义为带有特征选择的多元回归实例,我们发现只需在预测阶段使用核岭回归即可显著改进现有高效基准测试方法。此外,采用一种名为最小冗余最大相关性(mRMR)的基于信息论的特征选择算法,通过选择对预测最有用的问题子集,可进一步改进这些方法。除数据极其匮乏的情况外,这些方法在多种基准测试中(使用二进制和连续指标)均能持续实现更小的预测误差(包括平均绝对误差和均方根误差),以及预测分数与真实分数之间更高的排名相关性(斯皮尔曼相关系数ρ和肯德尔秩相关系数τ)。而且,mRMR子采样比竞争对手方法(通常涉及拟合概率模型或运行聚类算法)快得多,并且在不同的随机种子或训练数据划分下更倾向于选择相同的问题。教程代码可在https://github.com/sambowyer/mrmr_eval 获取。