基准测试的均匀性缺陷：量化性能在基准测试中的分布一致性 (The Flaw of Averages: Quantifying Uniformity of Performance on Benchmarks)

Benchmarks shape scientific conclusions about model capabilities and steer model development. This creates a feedback loop: stronger benchmarks drive better models, and better models demand more discriminative benchmarks. Ensuring benchmark reliability is therefore essential for trustworthy evaluation and meaningful progress. In this work, we study benchmark reliability from a distributional perspective and introduce benchmark harmony, which measures how uniformly a model's performance is distributed across the subdomains of a benchmark. We posit that high harmony is a desirable benchmark property, indicating that the aggregate metric reflects uniform competence across subdomains. Across 19 multiple-choice benchmarks and five model families, we map each benchmark onto a mean-variance plane of harmony computed across models, where high mean and low variance signal more reliable evaluation. Our analysis shows that less harmonious benchmarks can give misleading results, since overall accuracy may be disproportionately influenced by specific subdomains. For instance, ARC-Easy is overwhelmed by questions on Biological Concepts, overshadowing other critical subdomains such as Geography, Physics, Chemistry, and Environmental Science. By recommending that harmony should be reported alongside accuracy, we reframe evaluation from simple performance averages to a more robust, distributionally reliable measurement of performance.

翻译：基准测试决定了关于模型能力的科学结论并引导模型发展。这形成了一个反馈循环：更强的基准测试推动更好的模型，而更好的模型需要更具区分度的基准测试。因此，确保基准测试的可靠性对于可信评估和实质性进展至关重要。在本研究中，我们从分布视角探讨基准测试的可靠性，并引入基准和谐度这一指标，用于衡量模型性能在基准测试各子领域中的分布均匀程度。我们认为高和谐度是基准测试的理想属性，表明聚合指标能够反映跨子领域的均匀能力。通过对19个多项选择题基准测试和五个模型家族的分析，我们将每个基准测试映射到基于模型计算的和谐度均值-方差平面上，其中高均值与低方差标志着更可靠的评估。我们的分析表明，和谐度较低的基准测试可能产生误导性结果，因为整体准确率可能受到特定子领域的过度影响。例如，ARC-Easy基准测试被生物概念类问题所主导，掩盖了地理、物理、化学和环境科学等其他关键子领域。通过建议在报告准确率时同步报告和谐度指标，我们将评估框架从简单的性能平均值重构为更稳健、更具分布可靠性的性能测量体系。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/