Code-related benchmarks play a critical role in evaluating large language models (LLMs), yet their quality fundamentally shapes how the community interprets model capabilities. In the past few years, awareness of benchmark quality has grown. Yet, after a decade-scale (2014-2025) survey over 572 code benchmarks, we observed a lag between growing awareness and actual practice. For example, in 2025 alone, the number of benchmarks that ignore code coverage when providing test cases nearly matches the total count accumulated across the previous ten years. In response, we take a clear position: Code benchmarks must prioritize rigor in benchmark construction, reliability in evaluation, and reproducibility in release. To operationalize this position, we introduce a code benchmark guideline HOW2BENCH with 55 checklists. Finally, our further human study also exposed that the current issues not only stem from the significant effort required, but also from a lack of awareness regarding their importance.
翻译:代码相关基准在评估大语言模型(LLMs)中发挥着关键作用,但其质量从根本上决定了学界如何解读模型能力。过去几年间,对基准质量的关注度有所提升。然而,通过对572个代码基准进行十年期(2014-2025)调查后,我们发现关注度的提升与实际实践之间存在滞后。例如,仅在2025年,提供测试用例时忽略代码覆盖率的基准数量就几乎与前十年累计的总数相当。为此,我们提出明确立场:代码基准必须优先保证构建的严谨性、评估的可靠性以及发布的可复现性。为落实这一立场,我们提出了包含55项检查清单的代码基准指南HOW2BENCH。最后,我们进一步的人工研究还揭示,当前问题不仅源于所需投入的巨大工作量,也源于对其重要性的认知不足。