The ability of CodeLLMs to generate executable and functionally correct code at the repository-level scale remains largely unexplored. We introduce RepoExec, a novel benchmark for evaluating code generation at the repository-level scale. RepoExec focuses on three main aspects: executability, functional correctness through automated test case generation with high coverage rate, and carefully crafted cross-file contexts to accurately generate code. Our work explores a controlled scenario where developers specify necessary code dependencies, challenging the model to integrate these accurately. Experiments show that while pretrained LLMs outperform instruction-tuned models in correctness, the latter excel in utilizing provided dependencies and demonstrating debugging capabilities. We also introduce a new instruction-tuned dataset that focuses on code dependencies and demonstrate that CodeLLMs fine-tuned on our dataset have a better capability to leverage these dependencies effectively. RepoExec aims to provide a comprehensive evaluation of code functionality and alignment with developer intent, paving the way for more reliable and applicable CodeLLMs in real-world scenarios. The dataset and source code can be found at~\url{https://github.com/FSoft-AI4Code/RepoExec}.
翻译:代码大语言模型在仓库级别生成可执行且功能正确的代码能力仍很大程度上未被探索。我们提出了RepoExec,这是一个用于评估仓库级别代码生成的新型基准。RepoExec主要关注三个方面:可执行性、通过高覆盖率自动化测试用例生成的功能正确性,以及精心设计的跨文件上下文以准确生成代码。我们的工作探索了一种受控场景,即开发者指定必要的代码依赖项,挑战模型准确集成这些依赖。实验表明,虽然预训练大语言模型在正确性方面优于指令微调模型,但后者在利用提供的依赖项和展示调试能力方面表现更佳。我们还引入了一个专注于代码依赖项的新指令微调数据集,并证明基于我们数据集微调的代码大语言模型能更有效地利用这些依赖项。RepoExec旨在全面评估代码功能性与开发者意图的一致性,为在实际场景中开发更可靠、更适用的代码大语言模型铺平道路。数据集和源代码可在~\url{https://github.com/FSoft-AI4Code/RepoExec} 找到。