SystemVerilog Assertions (SVAs) are crucial for hardware verification. Recent studies leverage general-purpose LLMs to translate natural language properties to SVAs (NL2SVA), but they perform poorly due to limited data. We propose a data synthesis framework to tackle two challenges: the scarcity of high-quality real-world SVA corpora and the lack of reliable methods to determine NL-SVA semantic equivalence. For the former, large-scale open-source RTLs are used to guide LLMs to generate real-world SVAs; for the latter, bidirectional translation serves as a data selection method. With the synthesized data, we train CodeV-SVA, a series of SVA generation models. Notably, CodeV-SVA-14B achieves 75.8% on NL2SVA-Human and 84.0% on NL2SVA-Machine in Func.@1, matching or exceeding advanced LLMs like GPT-5 and DeepSeek-R1.
翻译:SystemVerilog断言(SVA)对于硬件验证至关重要。近期研究利用通用大语言模型(LLM)将自然语言属性转换为SVA(NL2SVA),但由于数据有限,其性能不佳。我们提出了一个数据合成框架,以应对两大挑战:高质量真实世界SVA语料库的稀缺性,以及缺乏可靠方法来判断自然语言与SVA之间的语义等价性。针对前者,我们利用大规模开源RTL来引导LLM生成真实世界的SVA;针对后者,双向翻译被用作一种数据选择方法。利用合成数据,我们训练了CodeV-SVA,一个系列的SVA生成模型。值得注意的是,CodeV-SVA-14B在NL2SVA-Human和NL2SVA-Machine数据集上的Func.@1指标分别达到75.8%和84.0%,匹配或超越了GPT-5和DeepSeek-R1等先进大语言模型。