To facilitate evaluation of code generation systems across diverse scenarios, we present CodeBenchGen, a framework to create scalable execution-based benchmarks that only requires light guidance from humans. Specifically, we leverage a large language model (LLM) to convert an arbitrary piece of code into an evaluation example, including test cases for execution-based evaluation. We illustrate the usefulness of our framework by creating a dataset, Exec-CSN, which includes 1,931 examples involving 293 libraries revised from code in 367 GitHub repositories taken from the CodeSearchNet dataset. To demonstrate the complexity and solvability of examples in Exec-CSN, we present a human study demonstrating that 81.3% of the examples can be solved by humans and 61% are rated as "requires effort to solve". We conduct code generation experiments on open-source and proprietary models and analyze the performance of both humans and models. We provide the code at https://github.com/Veronicium/CodeBenchGen.
翻译:为促进跨场景代码生成系统的评估,我们提出CodeBenchGen框架,该框架可创建仅需少量人工引导的可扩展执行基准测试。具体而言,我们利用大语言模型将任意代码片段转化为评估样本(包含用于执行评估的测试用例)。通过创建Exec-CSN数据集(包含从CodeSearchNet数据集中选取的367个GitHub仓库中修改自代码的1,931个样本,涉及293个库),我们展示了该框架的实用性。为证明Exec-CSN样本的复杂性与可解性,我们开展人类研究,结果显示81.3%的样本可被人类解决,61%的样本被评为"需付出努力才能解决"。我们基于开源和专有模型进行了代码生成实验,并分析了人类与模型的性能表现。相关代码已开源至https://github.com/Veronicium/CodeBenchGen。