We present four main contributions to enhance the performance of Large Language Models (LLMs) in generating domain-specific code: (i) utilizing LLM-based data splitting and data renovation techniques to improve the semantic representation of embeddings' space; (ii) introducing the Chain of Density for Renovation Credibility (CoDRC), driven by LLMs, and the Adaptive Text Renovation (ATR) algorithm for assessing data renovation reliability; (iii) developing the Implicit Knowledge Expansion and Contemplation (IKEC) Prompt technique; and (iv) effectively refactoring existing scripts to generate new and high-quality scripts with LLMs. By using engineering simulation software RedHawk-SC as a case study, we demonstrate the effectiveness of our data pre-processing method for expanding and categorizing scripts. When combined with IKEC, these techniques enhance the Retrieval-Augmented Generation (RAG) method in retrieving more relevant information, ultimately achieving a 73.33% "Percentage of Correct Lines" for code generation problems in MapReduce applications.
翻译:我们提出了四项主要贡献来增强大语言模型(LLMs)在生成领域特定代码方面的性能:(i)利用基于LLM的数据拆分与数据修复技术改善嵌入空间的语义表征;(ii)引入由LLM驱动的“密度链修复可信度验证”(CoDRC)方法及“自适应文本修复”(ATR)算法以评估数据修复可靠性;(iii)开发“隐式知识扩展与反思”(IKEC)提示技术;(iv)通过LLM高效重构现有脚本,生成高质量的新型脚本。以工程仿真软件RedHawk-SC为例,我们验证了所提出的数据预处理方法在脚本扩展与分类中的有效性。结合IKEC后,这些技术增强了检索增强生成(RAG)方法对相关信息的检索能力,最终在MapReduce应用的代码生成问题中实现了73.33%的“正确行百分比”。