The remarkable reasoning and code generation capabilities of large language models (LLMs) have spurred significant interest in applying LLMs to enable task automation in digital chip design. In particular, recent work has investigated early ideas of applying these models to formal verification (FV), an approach to verifying hardware implementations that can provide strong guarantees of confidence but demands significant amounts of human effort. While the value of LLM-driven automation is evident, our understanding of model performance, however, has been hindered by the lack of holistic evaluation. In response, we present FVEval, the first comprehensive benchmark and evaluation framework for characterizing LLM performance in tasks pertaining to FV. The benchmark consists of three sub-tasks that measure LLM capabilities at different levels: from the generation of SystemVerilog assertions (SVAs) given natural language descriptions to reasoning about the design RTL and suggesting assertions directly without additional human input. As test instances, we present both collections of expert-written verification collateral and methodologies to scalably generate synthetic examples aligned with industrial FV workflows. A wide range of existing LLMs, both proprietary and open-source, are evaluated against FVEval, based on which we investigate where today's LLMs stand and how we might further enable their application toward improving productivity in digital FV. Our benchmark and evaluation code is available at \url{https://github.com/NVlabs/FVEval}.
翻译:大型语言模型(LLM)卓越的推理与代码生成能力,激发了将其应用于数字芯片设计任务自动化的广泛兴趣。特别是,近期研究探索了将这些模型应用于形式验证(FV)的初步构想——形式验证是一种能够提供强置信度保证的硬件实现验证方法,但通常需要耗费大量人力。尽管LLM驱动的自动化价值显而易见,然而,由于缺乏整体性评估,我们对模型性能的理解一直受到制约。为此,我们提出了FVEval,这是首个用于刻画LLM在形式验证相关任务中性能的综合基准测试与评估框架。该基准包含三个子任务,从不同层面衡量LLM的能力:从根据自然语言描述生成SystemVerilog断言(SVA),到对设计RTL进行推理并直接提出断言而无需额外人工输入。作为测试实例,我们既提供了专家编写的验证资产集合,也提出了可扩展生成符合工业形式验证流程的合成样例的方法。我们对现有的大量LLM(包括专有模型和开源模型)进行了FVEval评估,并基于此探究了当前LLM的发展水平,以及如何进一步推动其应用于提升数字形式验证的生产力。我们的基准测试与评估代码已发布于 \url{https://github.com/NVlabs/FVEval}。