The ability of CodeLLMs to generate executable and functionally correct code at the \textit{repository-level scale }remains largely unexplored. We introduce \methodnamews, a novel benchmark for evaluating code generation at the repository-level scale, emphasizing executability and correctness. \methodnamews provides an automated system that verifies requirements and incorporates a mechanism for dynamically generating high-coverage test cases to assess the functionality of generated code. Our work explores a controlled scenario where developers specify necessary code dependencies, challenging the model to integrate these accurately. Experiments show that while pretrained LLMs outperform instruction-tuning models in correctness, the latter excel in utilizing provided dependencies and demonstrating debugging capabilities. \methodnamews aims to provide a comprehensive evaluation of code functionality and alignment with developer intent, paving the way for more reliable and applicable CodeLLMs in real-world scenarios.
翻译:代码大语言模型在\textit{仓库级规模}上生成可执行且功能正确代码的能力在很大程度上仍未得到充分探索。我们提出了\methodnamews,这是一个用于评估仓库级规模代码生成的新型基准,强调可执行性与正确性。\methodnamews提供了一套自动化系统,用于验证需求并整合了动态生成高覆盖率测试用例的机制,以评估生成代码的功能性。我们的研究探索了一种受控场景:开发者指定必要的代码依赖项,从而挑战模型准确集成这些依赖的能力。实验表明,虽然预训练大语言模型在正确性上优于指令微调模型,但后者在利用所提供依赖项和展示调试能力方面表现更佳。\methodnamews旨在对代码功能性和与开发者意图的一致性进行全面评估,为在实际场景中开发更可靠、更适用的代码大语言模型铺平道路。