Benchmarks shape scientific conclusions about model capabilities and steer model development. This creates a feedback loop: stronger benchmarks drive better models, and better models demand more discriminative benchmarks. Ensuring benchmark reliability is therefore essential for trustworthy evaluation and meaningful progress. In this work, we study benchmark reliability from a distributional perspective and introduce benchmark harmony, which measures how uniformly a model's performance is distributed across the subdomains of a benchmark. We posit that high harmony is a desirable benchmark property, indicating that the aggregate metric reflects uniform competence across subdomains. Across 19 multiple-choice benchmarks and five model families, we map each benchmark onto a mean-variance plane of harmony computed across models, where high mean and low variance signal more reliable evaluation. Our analysis shows that less harmonious benchmarks can give misleading results, since overall accuracy may be disproportionately influenced by specific subdomains. For instance, ARC-Easy is overwhelmed by questions on Biological Concepts, overshadowing other critical subdomains such as Geography, Physics, Chemistry, and Environmental Science. By recommending that harmony should be reported alongside accuracy, we reframe evaluation from simple performance averages to a more robust, distributionally reliable measurement of performance.
翻译:基准测试决定了关于模型能力的科学结论并引导模型发展。这形成了一个反馈循环:更强的基准测试推动更好的模型,而更好的模型需要更具区分度的基准测试。因此,确保基准测试的可靠性对于可信评估和实质性进展至关重要。在本研究中,我们从分布视角探讨基准测试的可靠性,并引入基准和谐度这一指标,用于衡量模型性能在基准测试各子领域中的分布均匀程度。我们认为高和谐度是基准测试的理想属性,表明聚合指标能够反映跨子领域的均匀能力。通过对19个多项选择题基准测试和五个模型家族的分析,我们将每个基准测试映射到基于模型计算的和谐度均值-方差平面上,其中高均值与低方差标志着更可靠的评估。我们的分析表明,和谐度较低的基准测试可能产生误导性结果,因为整体准确率可能受到特定子领域的过度影响。例如,ARC-Easy基准测试被生物概念类问题所主导,掩盖了地理、物理、化学和环境科学等其他关键子领域。通过建议在报告准确率时同步报告和谐度指标,我们将评估框架从简单的性能平均值重构为更稳健、更具分布可靠性的性能测量体系。