Data is all you need: Finetuning LLMs for Chip Design via an Automated design-data augmentation framework

Kaiyan Chang,Kun Wang,Nan Yang,Ying Wang,Dantong Jin,Wenlong Zhu,Zhirong Chen,Cangyuan Li,Hao Yan,Yunhao Zhou,Zhuoliang Zhao,Yuan Cheng,Yudong Pan,Yiqi Liu,Mengdi Wang,Shengwen Liang,Yinhe Han,Huawei Li,Xiaowei Li

from arxiv, DAC 2024

Recent advances in large language models have demonstrated their potential for automated generation of hardware description language (HDL) code from high-level prompts. Researchers have utilized fine-tuning to enhance the ability of these large language models (LLMs) in the field of Chip Design. However, the lack of Verilog data hinders further improvement in the quality of Verilog generation by LLMs. Additionally, the absence of a Verilog and Electronic Design Automation (EDA) script data augmentation framework significantly increases the time required to prepare the training dataset for LLM trainers. This paper proposes an automated design-data augmentation framework, which generates high-volume and high-quality natural language aligned with Verilog and EDA scripts. For Verilog generation, it translates Verilog files to an abstract syntax tree and then maps nodes to natural language with a predefined template. For Verilog repair, it uses predefined rules to generate the wrong verilog file and then pairs EDA Tool feedback with the right and wrong verilog file. For EDA Script generation, it uses existing LLM(GPT-3.5) to obtain the description of the Script. To evaluate the effectiveness of our data augmentation method, we finetune Llama2-13B and Llama2-7B models using the dataset generated by our augmentation framework. The results demonstrate a significant improvement in the Verilog generation tasks with LLMs. Moreover, the accuracy of Verilog generation surpasses that of the current state-of-the-art open-source Verilog generation model, increasing from 58.8% to 70.6% with the same benchmark. Our 13B model (ChipGPT-FT) has a pass rate improvement compared with GPT-3.5 in Verilog generation and outperforms in EDA script (i.e., SiliconCompiler) generation with only 200 EDA script data.

翻译：近年来，大语言模型的发展已展现出其根据高层级提示自动生成硬件描述语言代码的潜力。研究者已利用微调技术来增强这些大语言模型在芯片设计领域的能力。然而，Verilog数据的缺乏阻碍了LLMs生成Verilog代码质量的进一步提升。此外，Verilog与电子设计自动化脚本数据增强框架的缺失，显著增加了为LLM训练者准备训练数据集所需的时间。本文提出了一种自动化设计数据增强框架，该框架能够生成与Verilog及EDA脚本对齐的大规模、高质量自然语言描述。对于Verilog生成，它将Verilog文件转换为抽象语法树，然后使用预定义的模板将节点映射到自然语言。对于Verilog修复，它使用预定义规则生成错误的Verilog文件，然后将EDA工具反馈与正确及错误的Verilog文件配对。对于EDA脚本生成，它利用现有LLM（GPT-3.5）来获取脚本的描述。为了评估我们数据增强方法的有效性，我们使用增强框架生成的数据集对Llama2-13B和Llama2-7B模型进行了微调。结果表明，LLMs在Verilog生成任务上取得了显著提升。此外，Verilog生成的准确率超越了当前最先进的开源Verilog生成模型，在相同基准测试下从58.8%提高至70.6%。我们的13B模型（ChipGPT-FT）在Verilog生成上的通过率相较于GPT-3.5有所提升，并且在仅使用200个EDA脚本数据的情况下，在EDA脚本（即SiliconCompiler）生成任务上表现更优。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日