Code generation tools are essential to help developers in the software development process. Existing tools often disconnect with the working context, i.e., the code repository, causing the generated code to be not similar to human developers. In this paper, we propose a novel code generation framework, dubbed \textbf{$A^3$}-CodGen, to harness information within the code repository to generate code with fewer logical errors, code redundancy, and library-related compatibility issues. We identify three categories of representative information for the code repository: local-aware information from current code file, global-aware information from other code files, and third-party-library information. Results demonstrate that by adopting the \textbf{$A^3$}-CodGen framework, we successfully extract, fuse, and feed code repository information into the LLM, generating more accurate, efficient, and highly reusable code. The effectiveness of our framework is further underscored by generating code with a higher reuse rate, compared to human developers. This research contributes significantly to the field of code generation, providing developers with a more powerful tool to address the evolving demands in software development in practice.
翻译:代码生成工具在软件开发过程中对开发者至关重要。现有工具常与工作上下文(即代码仓库)脱节,导致生成的代码与人类开发者编写的代码相似度不足。本文提出一种新型代码生成框架 \textbf{$A^3$}-CodGen,通过利用代码仓库中的信息生成逻辑错误更少、冗余度更低且与库兼容性问题更少的代码。我们识别出代码仓库中的三类代表性信息:源自当前代码文件的局部感知信息、源自其他代码文件的全局感知信息以及第三方库信息。结果表明,采用 \textbf{$A^3$}-CodGen 框架能有效提取、融合并将代码仓库信息输入大语言模型,生成更准确、高效且可复用的代码。本框架的有效性进一步体现在生成的代码复用率高于人类开发者。这项研究对代码生成领域具有重要意义,为开发者提供了应对软件开发实践中不断演变需求的更强大工具。