Large language models (LLMs) have manifested strong ability to generate codes for productive activities. However, current benchmarks for code synthesis, such as HumanEval, MBPP, and DS-1000, are predominantly oriented towards introductory tasks on algorithm and data science, insufficiently satisfying challenging requirements prevalent in real-world coding. To fill this gap, we propose NaturalCodeBench (NCB), a challenging code benchmark designed to mirror the complexity and variety of scenarios in real coding tasks. NCB comprises 402 high-quality problems in Python and Java, meticulously selected from natural user queries from online coding services, covering 6 different domains. Noting the extraordinary difficulty in creating testing cases for real-world queries, we also introduce a semi-automated pipeline to enhance the efficiency of test case construction. Comparing with manual solutions, it achieves an efficiency increase of more than 4 times. Our systematic experiments on 39 LLMs find that performance gaps on NCB between models with close HumanEval scores could still be significant, indicating a lack of focus on practical code synthesis scenarios or over-specified optimization on HumanEval. On the other hand, even the best-performing GPT-4 is still far from satisfying on NCB. The evaluation toolkit and development set are available at https://github.com/THUDM/NaturalCodeBench.
翻译:大型语言模型(LLMs)在生成用于生产活动的代码方面展现出强大能力。然而,当前的代码合成基准(如HumanEval、MBPP和DS-1000)主要面向算法和数据科学领域的入门级任务,难以充分满足现实编码中普遍存在的挑战性需求。为填补这一空白,我们提出NaturalCodeBench(NCB),这是一个旨在反映真实编码任务复杂性与场景多样性的挑战性代码基准。NCB包含402个高质量的Python和Java问题,这些问题从在线编码服务的自然用户查询中精心筛选,覆盖6个不同领域。鉴于为现实世界查询创建测试用例的极高难度,我们还引入了一个半自动化流水线来提升测试用例构建效率。与人工解决方案相比,其效率提升超过4倍。我们对39个LLM的系统性实验发现,在HumanEval分数相近的模型之间,NCB上的性能差距仍然显著,这表明模型缺乏对实际代码合成场景的关注,或在HumanEval上存在过度特定优化。另一方面,即使性能最佳的GPT-4在NCB上仍远未达到令人满意的水平。评估工具包和开发集可在https://github.com/THUDM/NaturalCodeBench获取。