Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench

Recent advancements in Language Models (LMs) have catalyzed the creation of multiple benchmarks, designed to assess these models' general capabilities. A crucial task, however, is assessing the validity of the benchmarks themselves. This is most commonly done via Benchmark Agreement Testing (BAT), where new benchmarks are validated against established ones using some agreement metric (e.g., rank correlation). Despite the crucial role of BAT for benchmark builders and consumers, there are no standardized procedures for such agreement testing. This deficiency can lead to invalid conclusions, fostering mistrust in benchmarks and upending the ability to properly choose the appropriate benchmark to use. By analyzing over 40 prominent benchmarks, we demonstrate how some overlooked methodological choices can significantly influence BAT results, potentially undermining the validity of conclusions. To address these inconsistencies, we propose a set of best practices for BAT and demonstrate how utilizing these methodologies greatly improves BAT robustness and validity. To foster adoption and facilitate future research,, we introduce BenchBench, a python package for BAT, and release the BenchBench-leaderboard, a meta-benchmark designed to evaluate benchmarks using their peers. Our findings underscore the necessity for standardized BAT, ensuring the robustness and validity of benchmark evaluations in the evolving landscape of language model research. BenchBench Package: github.com/IBM/BenchBench Leaderboard: hf.co/spaces/IBM/BenchBench

翻译：近年来，语言模型（LMs）的快速发展催生了多种基准测试的创建，旨在评估这些模型的通用能力。然而，一个关键任务在于评估基准测试本身的有效性。这通常通过基准一致性测试（BAT）来实现，即使用某种一致性度量指标（如秩相关系数）将新基准与既有基准进行对比验证。尽管BAT对基准构建者和使用者至关重要，但目前尚未建立此类一致性测试的标准化流程。这一缺陷可能导致无效结论，引发对基准测试的不信任，并削弱人们正确选择适用基准的能力。通过分析40余个重要基准测试，我们揭示了某些被忽视的方法论选择如何显著影响BAT结果，进而可能损害结论的有效性。为解决这些不一致性问题，我们提出了一套BAT最佳实践方案，并论证采用这些方法如何大幅提升BAT的鲁棒性与有效性。为促进方法采纳并推动未来研究，我们推出了用于BAT的Python工具包BenchBench，同时发布了BenchBench排行榜——这是一个通过同行评估来评价基准测试的元基准平台。我们的研究结果强调了标准化BAT的必要性，以确保在语言模型研究不断发展的背景下，基准评估始终保持鲁棒性与有效性。BenchBench工具包：github.com/IBM/BenchBench 排行榜：hf.co/spaces/IBM/BenchBench