Code generation models are typically compared using compact execution benchmarks and aggregate pass rates, but such summaries obscure how performance varies across programming languages, problem families, and failure modes. We present a large-scale, execution-grounded evaluation of 9 openly accessible LLMs specialized for coding on 2,707 free LeetCode problems across 12 programming languages. Our corpus contains 325,343 problem-model-language jobs, each linked to prompt metadata, extracted code, LeetCode execution outcomes, and static-analysis signals. The results show that current open models remain far from the human acceptance reference: the best model, Yi-Coder-9B-Chat, reaches 23.64% mean correctness, compared with a 57.2% human acceptance baseline. Rankings are also slice-dependent: Qwen2.5-Coder-14B-Instruct is strongest on hard problems and distinct-problem coverage, while Gemma-2-27B-IT achieves the highest all-language lint pass rate. Failure analysis shows that compile errors account for 63.25% of non-accepted best submissions, indicating that many failures occur before semantic correctness can be tested. Static quality further diverges from functional correctness. Together, these findings show that multilingual, artifact-preserving evaluation reveals tradeoffs hidden by single-language or single-metric leaderboards.
翻译:代码生成模型通常使用紧凑的执行基准测试和总体通过率进行比较,但此类汇总掩盖了性能在不同编程语言、问题族和失败模式间的差异。我们对9个专门用于编码的开放大语言模型进行了大规模、执行基础的评估,涵盖12种编程语言中的2,707个免费LeetCode问题。语料库包含325,343个问题-模型-语言作业,每个作业均关联提示元数据、提取代码、LeetCode执行结果和静态分析信号。结果表明,当前开放模型仍远未达到人类接受参考水平:最佳模型Yi-Coder-9B-Chat平均正确率达到23.64%,而人类接受基线为57.2%。排名也呈现分段依赖性:Qwen2.5-Coder-14B-Instruct在困难问题和独特问题覆盖上表现最强,而Gemma-2-27B-IT在所有语言的lint通过率上最高。失败分析显示,编译错误占未接受最佳提交的63.25%,表明许多失败发生在语义正确性可被测试之前。静态质量进一步偏离功能正确性。综合这些发现表明,多语言、保留工件的评估揭示了被单语言或单指标排行榜所隐藏的权衡。