This paper introduces Code Bench, a benchmark capable of evaluating Large Language Models (LLMs) concise code generation abilities in 60 programming languages. Based on code golf, a recreational programming competition focused on minimal character or byte solutions, the benchmark provides a distinctive measure of LLMs ability to produce efficient, concise code. Unlike existing benchmarks limited by fixed problem sets and language coverage, CodeGolf Bench leverages the code.golf platform to provide new problems and live human performance baselines. Evaluation of nine LLMs on Python and C++ tasks demonstrates that reasoning models significantly outperform non-reasoning models, achieving best average percentile of 70.97%. This performance gap is particularly pronounced in C++, highlighting reasoning's importance for languages with strict syntax requirements. Non-reasoning models struggle more with efficiency optimization across both languages, with best percentiles significantly lower than reasoning counterparts. CodeGolf Bench offers a dynamic framework for evaluating LLM code generation capabilities against evolving human performance on code golf.
翻译:本文介绍了CodeGolf Bench,一个能够评估大型语言模型(LLMs)在60种编程语言中简洁代码生成能力的基准测试。该基准基于代码高尔夫(一种专注于最小字符数或字节数解决方案的娱乐性编程竞赛),为衡量LLMs生成高效、简洁代码的能力提供了独特指标。与现有受限于固定问题集和语言覆盖范围的基准不同,CodeGolf Bench利用code.golf平台提供新问题及人类实时表现基准。对九种LLMs在Python和C++任务上的评估表明,推理模型显著优于非推理模型,最佳平均百分位数达到70.97%。这一性能差距在C++中尤为明显,凸显了推理对于严格语法要求语言的重要性。非推理模型在两种语言的效率优化方面表现更差,其最佳百分位数显著低于推理模型。CodeGolf Bench提供了一个动态框架,用于评估LLMs在代码高尔夫中对抗不断进化的人类表现的代码生成能力。