FVEval: Understanding Language Model Capabilities in Formal Verification of Digital Hardware

The remarkable reasoning and code generation capabilities of large language models (LLMs) have spurred significant interest in applying LLMs to enable task automation in digital chip design. In particular, recent work has investigated early ideas of applying these models to formal verification (FV), an approach to verifying hardware implementations that can provide strong guarantees of confidence but demands significant amounts of human effort. While the value of LLM-driven automation is evident, our understanding of model performance, however, has been hindered by the lack of holistic evaluation. In response, we present FVEval, the first comprehensive benchmark and evaluation framework for characterizing LLM performance in tasks pertaining to FV. The benchmark consists of three sub-tasks that measure LLM capabilities at different levels: from the generation of SystemVerilog assertions (SVAs) given natural language descriptions to reasoning about the design RTL and suggesting assertions directly without additional human input. As test instances, we present both collections of expert-written verification collateral and methodologies to scalably generate synthetic examples aligned with industrial FV workflows. A wide range of existing LLMs, both proprietary and open-source, are evaluated against FVEval, based on which we investigate where today's LLMs stand and how we might further enable their application toward improving productivity in digital FV. Our benchmark and evaluation code is available at \url{https://github.com/NVlabs/FVEval}.

翻译：大型语言模型（LLM）卓越的推理与代码生成能力，激发了将其应用于数字芯片设计任务自动化的广泛兴趣。特别是，近期研究探索了将这些模型应用于形式验证（FV）的初步构想——形式验证是一种能够提供强置信度保证的硬件实现验证方法，但通常需要耗费大量人力。尽管LLM驱动的自动化价值显而易见，然而，由于缺乏整体性评估，我们对模型性能的理解一直受到制约。为此，我们提出了FVEval，这是首个用于刻画LLM在形式验证相关任务中性能的综合基准测试与评估框架。该基准包含三个子任务，从不同层面衡量LLM的能力：从根据自然语言描述生成SystemVerilog断言（SVA），到对设计RTL进行推理并直接提出断言而无需额外人工输入。作为测试实例，我们既提供了专家编写的验证资产集合，也提出了可扩展生成符合工业形式验证流程的合成样例的方法。我们对现有的大量LLM（包括专有模型和开源模型）进行了FVEval评估，并基于此探究了当前LLM的发展水平，以及如何进一步推动其应用于提升数字形式验证的生产力。我们的基准测试与评估代码已发布于 \url{https://github.com/NVlabs/FVEval}。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日