The evaluation of Large Language Models (LLMs) for software engineering has shifted towards complex, repository-level tasks. However, existing benchmarks predominantly rely on coarse-grained pass rates that treat programming proficiency as a monolithic capability, obscuring specific cognitive bottlenecks. Furthermore, the static nature of these benchmarks renders them vulnerable to data contamination and performance saturation. To address these limitations, we introduce CoreCodeBench, a configurable repository-level benchmark designed to dissect coding capabilities through atomized tasks. Leveraging our automated framework, CorePipe, we extract and transform Python repositories into a comprehensive suite of tasks that isolate distinct cognitive demands within identical code contexts. Unlike static evaluations, CoreCodeBench supports controllable difficulty scaling to prevent saturation and ensures superior data quality. It achieves a 78.55% validity yield, significantly surpassing the 31.7% retention rate of SWE-bench-Verified. Extensive experiments with state-of-the-art LLMs reveal a significant capability misalignment, evidenced by distinct ranking shifts across cognitive dimensions. This indicates that coding proficiency is non-monolithic, as strength in one aspect does not necessarily translate to others. These findings underscore the necessity of our fine-grained taxonomy in diagnosing model deficiencies and offer a sustainable, rigorous framework for evolving code intelligence. The code for CorePipe is available at https://github.com/AGI-Eval-Official/CoreCodeBench, and the data for CoreCodeBench can be accessed at https://huggingface.co/collections/tubehhh/corecodebench-68256d2faabf4b1610a08caa.
翻译:大型语言模型(LLM)在软件工程领域的评估已转向复杂的仓库级任务。然而,现有基准测试主要依赖粗粒度的通过率,将编程能力视为单一整体能力,掩盖了特定的认知瓶颈。此外,这些基准测试的静态特性使其易受数据污染和性能饱和的影响。为应对这些局限,我们提出了CoreCodeBench——一个可配置的仓库级基准测试,旨在通过原子化任务剖析编码能力。借助我们自主研发的自动化框架CorePipe,我们将Python仓库提取并转化为一套全面的任务集,在相同的代码上下文中隔离不同的认知需求。与静态评估不同,CoreCodeBench支持可控的难度调节以防止性能饱和,并确保卓越的数据质量。其任务有效率达到78.55%,显著超越SWE-bench-Verified仅31.7%的保留率。通过对前沿LLM的广泛实验,我们发现模型能力存在显著错位,表现为不同认知维度上的排名明显波动。这表明编程能力并非单一整体,某一方面的优势未必能转化为其他方面的能力。这些发现凸显了我们提出的细粒度分类法在诊断模型缺陷方面的必要性,并为代码智能的持续演进提供了一个可持续且严谨的评估框架。CorePipe的代码发布于https://github.com/AGI-Eval-Official/CoreCodeBench,CoreCodeBench的数据集可通过https://huggingface.co/collections/tubehhh/corecodebench-68256d2faabf4b1610a08caa获取。