This paper aims to extend the code generation capability of large language models (LLMs) to automatically manage comprehensive software requirements from given textual descriptions. Such requirements include both functional (i.e. achieving expected behavior for inputs) and non-functional (e.g., time/space performance, robustness, maintainability) requirements. However, textual descriptions can either express requirements verbosely or may even omit some of them. We introduce ARCHCODE, a novel framework that leverages in-context learning to organize requirements observed in descriptions and to extrapolate unexpressed requirements from them. ARCHCODE generates requirements from given descriptions, conditioning them to produce code snippets and test cases. Each test case is tailored to one of the requirements, allowing for the ranking of code snippets based on the compliance of their execution results with the requirements. Public benchmarks show that ARCHCODE enhances to satisfy functional requirements, significantly improving Pass@k scores. Furthermore, we introduce HumanEval-NFR, the first evaluation of LLMs' non-functional requirements in code generation, demonstrating ARCHCODE's superiority over baseline methods. The implementation of ARCHCODE and the HumanEval-NFR benchmark are both publicly accessible.
翻译:本文旨在扩展大型语言模型(LLMs)的代码生成能力,使其能够根据给定的文本描述自动管理完整的软件需求。这些需求既包括功能性需求(即对输入实现预期行为),也包括非功能性需求(例如时间/空间性能、鲁棒性、可维护性等)。然而,文本描述可能冗长地表达需求,甚至可能遗漏部分需求。我们提出了ARCHCODE这一新颖框架,该框架利用上下文学习来组织描述中观察到的需求,并从中推断未明确表述的需求。ARCHCODE从给定描述中生成需求,并以此为指导生成代码片段和测试用例。每个测试用例都针对特定需求进行定制,从而能够根据代码执行结果与需求的符合程度对代码片段进行排序。公开基准测试表明,ARCHCODE在满足功能性需求方面表现优异,显著提升了Pass@k分数。此外,我们提出了HumanEval-NFR——首个针对LLMs代码生成中非功能性需求的评估基准,证明了ARCHCODE相较于基线方法的优越性。ARCHCODE的实现代码及HumanEval-NFR基准均已开源。