CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation

Utilizing large language models to generate codes has shown promising meaning in software development revolution. Despite the intelligence shown by the general large language models, their specificity in code generation can still be improved due to the syntactic gap and mismatched vocabulary existing among natural language and different programming languages. In this paper, we propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs. CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language, which can facilitate natural language based LLMs for better understanding of code syntax and serve as a bridge among different programming languages. To take the extracted structural knowledge into the foundation models, we propose 1) a hard meta-graph prompt template to transform the challenging graphical representation into informative knowledge for tuning-free models and 2) a soft prompting technique that injects the domain knowledge of programming languages into the model parameters via finetuning the models with the help of a pretrained GNN expert model. Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert. CodeGRAG improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation. Code is available at https://anonymous.4open.science/r/Code-5970/.

翻译：利用大型语言模型生成代码在推动软件开发变革方面展现出重要意义。尽管通用大语言模型已展现出智能特性，但由于自然语言与不同编程语言之间存在句法鸿沟和词汇不匹配问题，其在代码生成方面的专长仍有提升空间。本文提出CodeGRAG——一种图检索增强的代码生成框架，以增强大语言模型的性能。CodeGRAG基于代码块的控制流与数据流构建代码的图结构视图，以弥合编程语言与自然语言之间的隔阂，这既能帮助基于自然语言的大语言模型更好地理解代码语法，也可作为不同编程语言之间的桥梁。为将提取的结构化知识注入基础模型，我们提出：1）硬元图提示模板，将具有挑战性的图表示转化为可供免调优模型使用的信息知识；2）软提示技术，借助预训练的图神经网络专家模型，通过微调将编程语言的领域知识注入模型参数。我们在包含C++和Python的四个数据集上进行了多组实验与消融研究，验证了硬元图提示、软提示技术以及预训练图神经网络专家目标函数的有效性。CodeGRAG提升了大语言模型的代码生成能力，甚至能为跨语言代码生成带来性能增益。代码发布于https://anonymous.4open.science/r/Code-5970/。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/