Recently, there has been a growing interest in leveraging Large Language Models for Verilog code generation. However, the current quality of the generated Verilog code remains suboptimal. This is largely due to the absence of well-defined, well-organized datasets with high-quality samples, as well as a lack of innovative fine-tuning methods and models specifically trained on Verilog. In this paper, we introduce a novel open-source dataset and a corresponding fine-tuning technique, which utilizes a multi-layered structure that we refer to as PyraNet. Our experiments demonstrate that employing the proposed dataset and fine-tuning approach leads to a more accurate fine-tuned model, producing syntactically and functionally correct Verilog code. The evaluation results show improvements by up-to $32.6\%$ in comparison to the CodeLlama-7B baseline model and up-to $16.7\%$ in comparison to the state-of-the-art models using VerilogEval evaluation platform.
翻译:近年来,利用大语言模型生成Verilog代码的研究日益受到关注。然而,当前生成的Verilog代码质量仍不理想。这主要是由于缺乏定义清晰、组织有序的高质量样本数据集,以及缺少专门针对Verilog训练的创新型微调方法与模型。本文提出了一种新颖的开源数据集及相应的微调技术,该技术采用我们称之为PyraNet的多层结构。实验表明,采用所提出的数据集和微调方法能够获得更精确的微调模型,生成语法和功能均正确的Verilog代码。评估结果显示,相较于CodeLlama-7B基线模型,其性能在VerilogEval评估平台上最高提升达$32.6\%$;相较于现有最优模型,最高提升达$16.7\%$。