Code2Bench: Scaling Source and Rigor for Dynamic Benchmark Construction

The evaluation of code-generating Large Language Models (LLMs) is fundamentally constrained by two intertwined challenges: a reliance on static, easily contaminated problem sources and the use of superficial, low-rigor testing. This paper introduces a new benchmark construction philosophy, Dual Scaling, designed to systematically address both limitations. Our approach involves continuously scaling the source of problems from dynamic, real-world code repositories and systematically scaling the rigor of tests via automated, high-coverage Property-Based Testing (PBT). We instantiate this philosophy in CODE2BENCH, an end-to-end framework that leverages Scope Graph analysis for principled dependency classification and a 100% branch coverage quality gate to ensure test suite integrity. Using this framework, we construct CODE2BENCH-2509, a new benchmark suite with native instances in both Python and Java. Our extensive evaluation of 10 state-of-the-art LLMs on CODE2BENCH-2509, powered by a novel "diagnostic fingerprint" visualization, yields three key insights: (1) models exhibit a fundamental performance gap, excelling at API application (Weakly Self-Contained tasks) but struggling with algorithmic synthesis (Self-Contained tasks); (2) a model's performance is profoundly shaped by the target language's ecosystem, a nuance we are the first to systematically quantify; and (3) our rigorous, scaled testing is critical in uncovering an "illusion of correctness" prevalent in simpler benchmarks. Our work presents a robust, scalable, and diagnostic paradigm for the next generation of LLM evaluation in software engineering. The code, data, and results are available at https://code2bench.github.io/.

翻译：评估代码生成大型语言模型（LLM）从根本上受到两个相互交织的挑战制约：依赖静态且易受污染的问题来源，以及使用表面化、低严谨度的测试方法。本文提出了一种新的基准构建理念——双重扩展，旨在系统性地解决这两个局限。我们的方法包括：从动态的真实世界代码仓库中持续扩展问题来源，并通过自动化的高覆盖率基于属性的测试（PBT）系统性地提升测试的严谨性。我们将这一理念实例化为 CODE2BENCH，这是一个端到端的框架，它利用作用域图分析进行原则性的依赖分类，并通过 100% 分支覆盖率的质控门来确保测试套件的完整性。使用该框架，我们构建了 CODE2BENCH-2509，这是一个包含 Python 和 Java 原生实例的新基准套件。我们基于新颖的“诊断指纹”可视化技术，对 10 个最先进的 LLM 在 CODE2BENCH-2509 上进行了广泛评估，得出了三个关键发现：（1）模型表现出根本性的性能差距，擅长 API 应用（弱自包含任务），但在算法合成（自包含任务）方面表现不佳；（2）模型的性能深受目标语言生态系统的影响，我们是首个对此细微差别进行系统性量化的研究；（3）我们严谨且扩展的测试对于揭示在更简单基准中普遍存在的“正确性幻觉”至关重要。我们的工作为软件工程领域的下一代 LLM 评估提供了一个鲁棒、可扩展且具备诊断能力的范式。代码、数据和结果可在 https://code2bench.github.io/ 获取。