Efficient benchmarking techniques aim to lower the computational cost of evaluating LLMs by predicting full benchmark scores using only a subset of a benchmark's questions. By reframing this problem as an instance of multiple regression with feature selection, we find that existing efficient benchmarking methods can be greatly improved by simply using kernel ridge regression at the prediction stage. Additionally, using an information-theoretic feature-selection algorithm called minimum redundancy maximum relevance (mRMR), we can further improve upon these methods by selecting question subsets that will be maximally useful for prediction. Except in very data-poor settings, these approaches consistently achieve smaller prediction errors (in both MAE and RMSE), and greater ranking correlation between predicted and true scores (in both Spearman $ρ$ and Kendall $τ$) across a range of benchmarks using both binary and continuous metrics. Furthermore, mRMR subsampling is much faster than competitor methods (which often involve fitting probabilistic models or running clustering algorithms), and is more likely to select the same questions under different random seeds or training data splits. Tutorial code can be found at https://github.com/sambowyer/mrmr_eval .
翻译:高效基准测试技术旨在通过仅使用基准测试中的部分问题预测完整基准分数,从而降低评估大型语言模型(LLM)的计算成本。通过将该问题重新表述为带特征选择的多元回归实例,我们发现只需在预测阶段采用核岭回归,即可显著改进现有高效基准测试方法。此外,采用信息论特征选择算法——最小冗余最大相关性(mRMR),通过选择对预测最具价值的问题子集,可进一步提升这些方法的性能。除数据极度匮乏场景外,这些方法在多个基准测试中(涵盖二值指标与连续指标)始终能实现更低的预测误差(MAE与RMSE),以及更高的预测分数与真实分数之间的排序相关性(Spearman ρ与Kendall τ)。同时,mRMR子采样速度显著优于竞品方法(后者通常涉及拟合概率模型或运行聚类算法),且在不同随机种子或训练数据划分下更倾向于选择相同的问题。教程代码可访问https://github.com/sambowyer/mrmr_eval。