Automatic generation of RTL code for digital hardware designs remains challenging due to long-horizon reasoning, multi-step dependencies, and strict correctness constraints in Verilog and VHDL. We present StepPRM-RTL, a novel framework that combines stepwise trajectory modeling, process-reward modeling (PRM), and retrieval-augmented fine-tuning (RAFT) to enhance both the functional correctness and reasoning fidelity of LLM-based RTL code generation. StepPRM-RTL constructs stepwise reasoning trajectories from canonical solutions, where each step contains a rationale and incremental code modification. A Process Reward Model (PRM) evaluates intermediate steps, providing dense feedback that guides reinforcement-style updates during RAFT fine-tuning. Monte Carlo Tree Search (MCTS) explores alternative reasoning paths, enriching the training dataset with high-quality trajectories. This integration of stepwise and outcome-aware rewards allows the model to learn both how and why to construct correct RTL, improving long-horizon reasoning beyond standard supervised or outcome-based training. Experimental evaluation on benchmark Verilog and VHDL datasets demonstrates that StepPRM-RTL outperforms the best prior methods by over 10\% in functional correctness and reasoning fidelity metrics. Ablation studies confirm that the combination of PRM-guided rewards and stepwise trajectory exploration is key to its performance. StepPRM-RTL generalizes across RTL languages and provides a scalable framework for high-fidelity, interpretable code generation, establishing a new standard for LLM-assisted hardware design automation.
翻译:数字硬件设计中RTL代码的自动生成仍面临挑战,这主要源于Verilog和VHDL在长程推理、多步骤依赖以及严格正确性约束方面的要求。我们提出StepPRM-RTL这一新型框架,该框架结合了逐步轨迹建模、过程奖励建模(PRM)和检索增强微调(RAFT),旨在提升基于大语言模型(LLM)的RTL代码生成功能正确性与推理保真度。StepPRM-RTL从规范解中构建逐步推理轨迹,其中每一步包含推理依据与增量代码修改。过程奖励模型(PRM)对中间步骤进行评估,提供密集反馈以指导RAFT微调期间的强化式更新。蒙特卡洛树搜索(MCTS)探索替代推理路径,从而用高质量轨迹丰富训练数据集。这种结合逐步奖励与结果感知奖励的集成方式,使模型能够学习如何以及为何构建正确的RTL,从而在超越标准监督式或结果导向式训练的基础上,提升长程推理能力。在基准Verilog和VHDL数据集上的实验评估表明,StepPRM-RTL在功能正确性与推理保真度指标上比最佳现有方法提升超过10%。消融研究证实,PRM引导奖励与逐步轨迹探索的结合是其性能的关键。StepPRM-RTL可跨RTL语言泛化,并为高保真、可解释的代码生成提供了可扩展框架,为基于大语言模型的硬件设计自动化树立了新标准。