Utilizing large language models to generate codes has shown promising meaning in software development revolution. Despite the intelligence shown by the general large language models, their specificity in code generation can still be improved due to the syntactic gap and mismatched vocabulary existing among natural language and different programming languages. In this paper, we propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs. CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language, which can facilitate natural language based LLMs for better understanding of code syntax and serve as a bridge among different programming languages. To take the extracted structural knowledge into the foundation models, we propose 1) a hard meta-graph prompt template to transform the challenging graphical representation into informative knowledge for tuning-free models and 2) a soft prompting technique that injects the domain knowledge of programming languages into the model parameters via finetuning the models with the help of a pretrained GNN expert model. Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert. CodeGRAG improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation. Code is available at https://anonymous.4open.science/r/Code-5970/.
翻译:利用大型语言模型生成代码在推动软件开发变革方面展现出重要意义。尽管通用大语言模型已展现出智能特性,但由于自然语言与不同编程语言之间存在句法鸿沟和词汇不匹配问题,其在代码生成方面的专长仍有提升空间。本文提出CodeGRAG——一种图检索增强的代码生成框架,以增强大语言模型的性能。CodeGRAG基于代码块的控制流与数据流构建代码的图结构视图,以弥合编程语言与自然语言之间的隔阂,这既能帮助基于自然语言的大语言模型更好地理解代码语法,也可作为不同编程语言之间的桥梁。为将提取的结构化知识注入基础模型,我们提出:1)硬元图提示模板,将具有挑战性的图表示转化为可供免调优模型使用的信息知识;2)软提示技术,借助预训练的图神经网络专家模型,通过微调将编程语言的领域知识注入模型参数。我们在包含C++和Python的四个数据集上进行了多组实验与消融研究,验证了硬元图提示、软提示技术以及预训练图神经网络专家目标函数的有效性。CodeGRAG提升了大语言模型的代码生成能力,甚至能为跨语言代码生成带来性能增益。代码发布于https://anonymous.4open.science/r/Code-5970/。