LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes -- compositional code creation, i.e., building a complete, internally structured class from a specification -- remains underserved. Current evaluations are either confined to isolated functions or rely on manually curated class-level tasks that are expensive to scale and increasingly susceptible to data contamination. We introduce ClassEval-Pro, a benchmark of 300 class-level tasks spanning 11 domains, constructed through an automated three-stage pipeline that combines complexity enhancement, cross-domain class composition, and integration of real-world GitHub code contributed after January 2025. Every task is validated by an LLM Judge Ensemble and must pass test suites with over 90% line coverage. We evaluate five frontier LLMs under five generation strategies. The best model achieves only 45.6% class-level Pass@1, with a 17.7-point gap between the strongest and weakest models, confirming the benchmark's discriminative power. Strategy choice strongly interacts with model capability: structured approaches such as bottom-up improve weaker models by up to 9.4 percentage points, while compositional generation collapses to as low as 1.3%. Error analysis over 500 manually annotated failures reveals that logic errors (56.2%) and dependency errors (38.0%) dominate, identifying cross-method coordination as the core bottleneck.
翻译:大语言模型(LLM)在函数级代码合成与仓库级代码修改任务上均取得了显著成果,然而处于两者之间的能力——组合式代码创建,即根据规格说明构建一个完整、内部结构化的类——仍未得到充分评估。现有评估要么局限于孤立函数,要么依赖人工整理的类级任务,这类任务不仅扩展成本高昂,且日益面临数据污染风险。我们提出ClassEval-Pro,一个包含300个类级任务、横跨11个领域的基准测试集。该基准通过自动化三阶段流水线构建,结合了复杂度增强、跨领域类组合以及集成2025年1月后贡献的真实世界GitHub代码。每项任务经由LLM评审团集成验证,且必须通过行覆盖率超过90%的测试套件。我们在五种生成策略下评估了五个前沿LLM。最佳模型仅达到45.6%的类级Pass@1,最强与最弱模型间存在17.7个百分点的性能差距,验证了该基准的区分能力。策略选择与模型能力存在强交互作用:自底向上等结构化方法可使较弱模型提升高达9.4个百分点,而组合式生成策略性能最低降至1.3%。基于500个手动标注失败案例的错误分析表明,逻辑错误(56.2%)与依赖错误(38.0%)占主导地位,跨方法协调被确认为核心瓶颈。