Revisiting VerilogEval: Newer LLMs, In-Context Learning, and Specification-to-RTL Tasks

The application of large-language models (LLMs) to digital hardware code generation is an emerging field. Most LLMs are primarily trained on natural language and software code. Hardware code, such as Verilog, represents only a small portion of the training data and few hardware benchmarks exist. To address this gap, the open-source VerilogEval benchmark was released in 2023, providing a consistent evaluation framework for LLMs on code completion tasks. It was tested on state-of-the-art models at the time including GPT-4. However, VerilogEval and other Verilog generation benchmarks lack failure analysis and, in present form, are not conducive to exploring prompting techniques. Also, since VerilogEval's release, both commercial and open-source models have seen continued development. In this work, we evaluate new commercial and open-source models of varying sizes against an improved VerilogEval benchmark suite. We enhance VerilogEval's infrastructure and dataset by automatically classifying failures, introduce new prompts for supporting in-context learning (ICL) examples, and extend the supported tasks to specification-to-RTL translation. We find a measurable improvement in commercial state-of-the-art models, with GPT-4 Turbo achieving a 59% pass rate on spec-to-RTL tasks. We also study the performance of open-source and domain-specific models that have emerged, and demonstrate that models can benefit substantially from ICL. We find that recently-released Llama 3.1 405B achieves a pass rate of 58%, effectively matching that of GPT-4 Turbo, and that the much smaller domain-specific RTL-Coder 6.7B models achieve an impressive 37% pass rate. However, prompt engineering is key to achieving good pass rates, and varies widely with model and task. A benchmark infrastructure that allows for prompt engineering and failure analysis is key to continued model development and deployment.

翻译：大型语言模型（LLM）在数字硬件代码生成领域的应用是一个新兴研究方向。大多数LLM主要基于自然语言和软件代码进行训练，而硬件描述语言（如Verilog）仅占训练数据的极小部分，且现有的硬件基准测试集稀缺。为填补这一空白，开源基准测试集VerilogEval于2023年发布，为LLM的代码补全任务提供了统一的评估框架。该基准曾对包括GPT-4在内的当时最先进模型进行测试。然而，VerilogEval及其他Verilog生成基准缺乏故障分析机制，且现有形式不利于探索提示工程技术。此外，自VerilogEval发布以来，商业和开源模型均持续发展。本研究采用改进的VerilogEval基准套件，对不同规模的新型商业与开源模型进行评估。我们通过自动故障分类增强了VerilogEval的基础设施与数据集，引入支持上下文学习（ICL）示例的新型提示模板，并将支持任务扩展至规范到RTL的翻译。研究发现当前最先进的商业模型取得显著进步，GPT-4 Turbo在规范到RTL任务中达到59%的通过率。同时，我们对新兴开源模型及领域专用模型进行性能研究，证明模型能从ICL中大幅获益。最新发布的Llama 3.1 405B模型达到58%的通过率，与GPT-4 Turbo表现相当；而规模小得多的领域专用模型RTL-Coder 6.7B也取得了37%的优异通过率。但提示工程对实现高通过率至关重要，且其效果因模型和任务差异显著。支持提示工程与故障分析的基准基础设施，对于持续推动模型开发与部署具有关键意义。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日