The prohibitive cost of evaluating large language models (LLMs) on comprehensive benchmarks necessitates the creation of small yet representative data subsets (i.e., tiny benchmarks) that enable efficient assessment while retaining predictive fidelity. Current methods for this task operate under a model-centric paradigm, selecting benchmarking items based on the collective performance of existing models. Such approaches are limited by large upfront costs, an inability to immediately handle new benchmarks (`cold-start'), and the fragile assumption that future models will share the failure patterns of their predecessors. In this work, we challenge this paradigm and propose a item-centric approach to benchmark subset selection, arguing that selection should be based on the intrinsic properties of the task items themselves, rather than on model-specific failure patterns. We instantiate this item-centric efficient benchmarking approach via a novel method, Scales++, where data selection is based on the cognitive demands of the benchmark samples. Empirically, we show Scales++ reduces the upfront selection cost by over 18x while achieving competitive predictive fidelity. On the Open LLM Leaderboard, using just a 0.5\% data subset, we predict full benchmark scores with a 2.9% mean absolute error. We demonstrate that this item-centric approach enables more efficient model evaluation without significant fidelity degradation, while also providing better cold-start performance and more interpretable benchmarking.
翻译:在大规模语言模型(LLMs)上执行全面基准测试的过高成本,催生了构建小型但具代表性的数据子集(即微型基准)的需求,以实现高效评估的同时保持预测保真度。当前该任务的方法遵循模型中心范式,即依据现有模型的集体表现来选择基准测试项目。这类方法受限于高昂的前期成本、无法即时处理新基准(“冷启动”)的缺陷,以及一个脆弱的假设——未来模型将延续其前代模型的失败模式。本研究挑战了这一范式,并提出一种项目中心的基准子集选择方法,主张选择应基于任务项目本身的内在属性,而非模型特定的失败模式。我们通过一种新颖方法Scales++实例化了这一项目中心的高效基准测试方法,其中数据选择基于基准样本的认知需求。实验表明,Scales++将前期选择成本降低超过18倍,同时实现了具有竞争力的预测保真度。在Open LLM Leaderboard上,仅使用0.5%的数据子集,我们预测完整基准分数的平均绝对误差为2.9%。我们证明,这种项目中心方法能够在无明显保真度损失的前提下实现更高效的模型评估,同时提供更好的冷启动性能和更具可解释性的基准测试。