Utilizing large language models to generate codes has shown promising meaning in software development revolution. Despite the intelligence shown by the general large language models, their specificity in code generation can still be improved due to the syntactic gap and mismatched vocabulary existing among natural language and different programming languages. In addition, programming languages are inherently logical and complex, making them hard to be correctly generated. Existing methods rely on multiple prompts to the large language model to explore better solutions, which is expensive. In this paper, we propose Syntax Graph Retrieval Augmented Code Generation (CodeGRAG) to enhance the performance of LLMs in single-round code generation tasks. CodeGRAG extracts and summarizes the control flow and data flow of code blocks to fill the gap between programming languages and natural language. The extracted external structural knowledge models the inherent flows of code blocks, which can facilitate LLMs for better understanding of code syntax and serve as a bridge among different programming languages. CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation, e.g., C++ for Python.
翻译:利用大型语言模型生成代码在软件开发革命中展现出重要意义。尽管通用大型语言模型展现了智能性,但由于自然语言与不同编程语言之间存在的语法差异和词汇不匹配,它们在代码生成方面的特异性仍有改进空间。此外,编程语言天生具有逻辑性和复杂性,这使得它们难以被正确生成。现有方法依赖对大型语言模型进行多次提示以探索更优解决方案,成本高昂。本文提出语法图检索增强代码生成(CodeGRAG)方法,以提升大语言模型在单轮代码生成任务中的性能。CodeGRAG通过提取并总结代码块的控制流和数据流,填补编程语言与自然语言之间的鸿沟。所提取的外部结构化知识对代码块的内在流程进行建模,可帮助大语言模型更好地理解代码语法,并充当不同编程语言之间的桥梁。CodeGRAG显著提升了模型的代码生成能力,甚至能为跨语言代码生成(如C++生成Python)带来性能提升。