QiMeng-CodeV-R1: Reasoning-Enhanced Verilog Generation

Yaoyu Zhu,Di Huang,Hanqi Lyu,Xiaoyun Zhang,Chongxiao Li,Wenxuan Shi,Yutong Wu,Jianan Mu,Jinghua Wang,Yang Zhao,Pengwei Jin,Shuyao Cheng,Shengwen Liang,Xishan Zhang,Rui Zhang,Zidong Du,Qi Guo,Xing Hu,Yunji Chen

Large language models (LLMs) trained via reinforcement learning with verifiable reward (RLVR) have achieved breakthroughs on tasks with explicit, automatable verification, such as software programming and mathematical problems. Extending RLVR to electronic design automation (EDA), especially automatically generating hardware description languages (HDLs) like Verilog from natural-language (NL) specifications, however, poses three key challenges: the lack of automated and accurate verification environments, the scarcity of high-quality NL-code pairs, and the prohibitive computation cost of RLVR. To this end, we introduce CodeV-R1, an RLVR framework for training Verilog generation LLMs. First, we develop a rule-based testbench generator that performs robust equivalence checking against golden references. Second, we propose a round-trip data synthesis method that pairs open-source Verilog snippets with LLM-generated NL descriptions, verifies code-NL-code consistency via the generated testbench, and filters out inequivalent examples to yield a high-quality dataset. Third, we employ a two-stage "distill-then-RL" training pipeline: distillation for the cold start of reasoning abilities, followed by adaptive DAPO, our novel RLVR algorithm that can reduce training cost by adaptively adjusting sampling rate. The resulting model, CodeV-R1-7B, achieves 68.6% and 72.9% pass@1 on VerilogEval v2 and RTLLM v1.1, respectively, surpassing prior state-of-the-art by 12~20%, while even exceeding the performance of 671B DeepSeek-R1 on RTLLM. We have released our model, training code, and dataset to facilitate research in EDA and LLM communities.

翻译：通过可验证奖励强化学习（RLVR）训练的大语言模型（LLM）已在具有明确、可自动化验证的任务上取得突破，例如软件编程和数学问题。然而，将RLVR扩展到电子设计自动化（EDA）领域，特别是从自然语言（NL）规范自动生成硬件描述语言（HDL）（如Verilog），面临三个关键挑战：缺乏自动化且准确的验证环境、高质量NL-代码对的稀缺性，以及RLVR过高的计算成本。为此，我们提出了CodeV-R1，一个用于训练Verilog生成LLM的RLVR框架。首先，我们开发了一个基于规则的测试平台生成器，可针对黄金参考执行稳健的等价性检查。其次，我们提出了一种往返数据合成方法，将开源Verilog代码片段与LLM生成的NL描述配对，通过生成的测试平台验证代码-NL-代码一致性，并过滤掉不等价的示例，从而产生高质量数据集。第三，我们采用了两阶段的“蒸馏后强化学习”训练流程：通过蒸馏实现推理能力的冷启动，随后采用我们新颖的RLVR算法——自适应DAPO，该算法可通过自适应调整采样率来降低训练成本。所得模型CodeV-R1-7B在VerilogEval v2和RTLLM v1.1上分别实现了68.6%和72.9%的pass@1，超越了先前最佳性能12~20%，甚至在RTLLM上超过了671B DeepSeek-R1的性能。我们已发布模型、训练代码和数据集，以促进EDA和LLM社区的研究。