The application of large-language models (LLMs) to digital hardware code generation is an emerging field. Most LLMs are primarily trained on natural language and software code. Hardware code, such as Verilog, represents only a small portion of the training data and few hardware benchmarks exist. To address this gap, the open-source VerilogEval benchmark was released in 2023, providing a consistent evaluation framework for LLMs on code completion tasks. It was tested on state-of-the-art models at the time including GPT-4. However, VerilogEval and other Verilog generation benchmarks lack failure analysis and, in present form, are not conducive to exploring prompting techniques. Also, since VerilogEval's release, both commercial and open-source models have seen continued development. In this work, we evaluate new commercial and open-source models of varying sizes against an improved VerilogEval benchmark suite. We enhance VerilogEval's infrastructure and dataset by automatically classifying failures, introduce new prompts for supporting in-context learning (ICL) examples, and extend the supported tasks to specification-to-RTL translation. We find a measurable improvement in commercial state-of-the-art models, with GPT-4 Turbo achieving a 59% pass rate on spec-to-RTL tasks. We also study the performance of open-source and domain-specific models that have emerged, and demonstrate that models can benefit substantially from ICL. We find that recently-released Llama 3.1 405B achieves a pass rate of 58%, effectively matching that of GPT-4 Turbo, and that the much smaller domain-specific RTL-Coder 6.7B models achieve an impressive 37% pass rate. However, prompt engineering is key to achieving good pass rates, and varies widely with model and task. A benchmark infrastructure that allows for prompt engineering and failure analysis is key to continued model development and deployment.
翻译:大型语言模型(LLM)在数字硬件代码生成领域的应用是一个新兴研究方向。大多数LLM主要基于自然语言和软件代码进行训练,而硬件描述语言(如Verilog)仅占训练数据的极小部分,且现有的硬件基准测试集稀缺。为填补这一空白,开源基准测试集VerilogEval于2023年发布,为LLM的代码补全任务提供了统一的评估框架。该基准曾对包括GPT-4在内的当时最先进模型进行测试。然而,VerilogEval及其他Verilog生成基准缺乏故障分析机制,且现有形式不利于探索提示工程技术。此外,自VerilogEval发布以来,商业和开源模型均持续发展。本研究采用改进的VerilogEval基准套件,对不同规模的新型商业与开源模型进行评估。我们通过自动故障分类增强了VerilogEval的基础设施与数据集,引入支持上下文学习(ICL)示例的新型提示模板,并将支持任务扩展至规范到RTL的翻译。研究发现当前最先进的商业模型取得显著进步,GPT-4 Turbo在规范到RTL任务中达到59%的通过率。同时,我们对新兴开源模型及领域专用模型进行性能研究,证明模型能从ICL中大幅获益。最新发布的Llama 3.1 405B模型达到58%的通过率,与GPT-4 Turbo表现相当;而规模小得多的领域专用模型RTL-Coder 6.7B也取得了37%的优异通过率。但提示工程对实现高通过率至关重要,且其效果因模型和任务差异显著。支持提示工程与故障分析的基准基础设施,对于持续推动模型开发与部署具有关键意义。