AssertionBench: A Benchmark to Evaluate Large-Language Models for Assertion Generation

Assertions have been the de facto collateral for simulation-based and formal verification of hardware designs for over a decade. The quality of hardware verification, \ie, detection and diagnosis of corner-case design bugs, is critically dependent on the quality of the assertions. There has been a considerable amount of research leveraging a blend of data-driven statistical analysis and static analysis to generate high-quality assertions from hardware design source code and design execution trace data. Despite such concerted effort, all prior research struggles to scale to industrial-scale large designs, generates too many low-quality assertions, often fails to capture subtle and non-trivial design functionality, and does not produce any easy-to-comprehend explanations of the generated assertions to understand assertions' suitability to different downstream validation tasks. Recently, with the advent of Large-Language Models (LLMs), there has been a widespread effort to leverage prompt engineering to generate assertions. However, there is little effort to quantitatively establish the effectiveness and suitability of various LLMs for assertion generation. In this paper, we present AssertionBench, a novel benchmark to evaluate LLMs' effectiveness for assertion generation quantitatively. AssertioBench contains 100 curated Verilog hardware designs from OpenCores and formally verified assertions for each design generated from GoldMine and HARM. We use AssertionBench to compare state-of-the-art LLMs to assess their effectiveness in inferring functionally correct assertions for hardware designs. Our experiments demonstrate how LLMs perform relative to each other, the benefits of using more in-context exemplars in generating a higher fraction of functionally correct assertions, and the significant room for improvement for LLM-based assertion generators.

翻译：断言作为硬件设计仿真验证与形式验证的事实标准辅助验证手段已沿用十余年。硬件验证的质量（即对极端情况设计缺陷的检测与诊断）关键取决于断言的质量。已有大量研究通过融合数据驱动的统计分析技术与静态分析技术，从硬件设计源代码及设计执行轨迹数据中生成高质量断言。尽管付出了这些协同努力，所有现有研究仍难以扩展至工业级大规模设计，往往产生过多低质量断言，经常无法捕捉微妙且复杂的设计功能特性，且未能为生成的断言提供易于理解的解释来说明其适用于不同下游验证任务的适配性。近年来，随着大语言模型（LLMs）的出现，学界开始广泛探索如何通过提示工程生成断言。然而，目前鲜有研究能定量评估各类LLMs在断言生成任务中的有效性与适用性。本文提出AssertionBench——一个用于定量评估LLMs断言生成效能的新型基准。AssertionBench包含从OpenCores精选的100个Verilog硬件设计，以及通过GoldMine与HARM工具为每个设计生成的形式化验证断言。我们运用AssertionBench对前沿LLMs进行比较，评估其在推断硬件设计功能正确断言方面的有效性。实验结果表明：不同LLMs之间的相对性能差异，使用更多上下文示例能提升生成功能正确断言的比例，以及基于LLM的断言生成器仍存在显著的改进空间。