Scales++：基于认知尺度嵌入的计算高效评估子集选择方法 (Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings)

The prohibitive cost of evaluating large language models (LLMs) on comprehensive benchmarks necessitates the creation of small yet representative data subsets (i.e., tiny benchmarks) that enable efficient assessment while retaining predictive fidelity. Current methods for this task operate under a model-centric paradigm, selecting benchmarking items based on the collective performance of existing models. Such approaches are limited by large upfront costs, an inability to immediately handle new benchmarks (`cold-start'), and the fragile assumption that future models will share the failure patterns of their predecessors. In this work, we challenge this paradigm and propose a item-centric approach to benchmark subset selection, arguing that selection should be based on the intrinsic properties of the task items themselves, rather than on model-specific failure patterns. We instantiate this item-centric efficient benchmarking approach via a novel method, Scales++, where data selection is based on the cognitive demands of the benchmark samples. Empirically, we show Scales++ reduces the upfront selection cost by over 18x while achieving competitive predictive fidelity. On the Open LLM Leaderboard, using just a 0.5\% data subset, we predict full benchmark scores with a 2.9% mean absolute error. We demonstrate that this item-centric approach enables more efficient model evaluation without significant fidelity degradation, while also providing better cold-start performance and more interpretable benchmarking.

翻译：在大规模语言模型（LLMs）上执行全面基准测试的过高成本，催生了构建小型但具代表性的数据子集（即微型基准）的需求，以实现高效评估的同时保持预测保真度。当前该任务的方法遵循模型中心范式，即依据现有模型的集体表现来选择基准测试项目。这类方法受限于高昂的前期成本、无法即时处理新基准（“冷启动”）的缺陷，以及一个脆弱的假设——未来模型将延续其前代模型的失败模式。本研究挑战了这一范式，并提出一种项目中心的基准子集选择方法，主张选择应基于任务项目本身的内在属性，而非模型特定的失败模式。我们通过一种新颖方法Scales++实例化了这一项目中心的高效基准测试方法，其中数据选择基于基准样本的认知需求。实验表明，Scales++将前期选择成本降低超过18倍，同时实现了具有竞争力的预测保真度。在Open LLM Leaderboard上，仅使用0.5%的数据子集，我们预测完整基准分数的平均绝对误差为2.9%。我们证明，这种项目中心方法能够在无明显保真度损失的前提下实现更高效的模型评估，同时提供更好的冷启动性能和更具可解释性的基准测试。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日