Recent advances in large language models have demonstrated their potential for automated generation of hardware description language (HDL) code from high-level prompts. Researchers have utilized fine-tuning to enhance the ability of these large language models (LLMs) in the field of Chip Design. However, the lack of Verilog data hinders further improvement in the quality of Verilog generation by LLMs. Additionally, the absence of a Verilog and Electronic Design Automation (EDA) script data augmentation framework significantly increases the time required to prepare the training dataset for LLM trainers. This paper proposes an automated design-data augmentation framework, which generates high-volume and high-quality natural language aligned with Verilog and EDA scripts. For Verilog generation, it translates Verilog files to an abstract syntax tree and then maps nodes to natural language with a predefined template. For Verilog repair, it uses predefined rules to generate the wrong verilog file and then pairs EDA Tool feedback with the right and wrong verilog file. For EDA Script generation, it uses existing LLM(GPT-3.5) to obtain the description of the Script. To evaluate the effectiveness of our data augmentation method, we finetune Llama2-13B and Llama2-7B models using the dataset generated by our augmentation framework. The results demonstrate a significant improvement in the Verilog generation tasks with LLMs. Moreover, the accuracy of Verilog generation surpasses that of the current state-of-the-art open-source Verilog generation model, increasing from 58.8% to 70.6% with the same benchmark. Our 13B model (ChipGPT-FT) has a pass rate improvement compared with GPT-3.5 in Verilog generation and outperforms in EDA script (i.e., SiliconCompiler) generation with only 200 EDA script data.
翻译:近年来,大语言模型的进展已展现出其根据高层级提示自动生成硬件描述语言代码的潜力。研究人员通过微调技术增强了这些大语言模型在芯片设计领域的能力。然而,Verilog数据的匮乏阻碍了大语言模型生成Verilog代码质量的进一步提升。此外,缺乏Verilog与电子设计自动化脚本的数据增强框架,显著增加了为模型训练准备数据集的耗时。本文提出了一种自动化设计-数据增强框架,能够生成与Verilog及EDA脚本对齐的高容量、高质量自然语言描述。针对Verilog生成任务,该框架将Verilog文件转换为抽象语法树,并通过预定义模板将节点映射为自然语言;针对Verilog修复任务,利用预设规则生成错误Verilog文件,并将EDA工具反馈与正确/错误文件配对;针对EDA脚本生成任务,借助现有大语言模型获取脚本描述。为评估数据增强方法的有效性,我们使用框架生成的数据集微调了Llama2-13B和Llama2-7B模型。结果表明,大语言模型在Verilog生成任务中的性能显著提升。在相同基准测试下,Verilog生成的准确率从58.8%提升至70.6%,超越了当前最先进的开源Verilog生成模型。我们的13B模型,在Verilog生成方面相比GPT-3.5通过率有所提高,且仅使用200条EDA脚本数据即超越其在EDA脚本生成任务中的表现。