StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis

Automatic generation of RTL code for digital hardware designs remains challenging due to long-horizon reasoning, multi-step dependencies, and strict correctness constraints in Verilog and VHDL. We present StepPRM-RTL, a novel framework that combines stepwise trajectory modeling, process-reward modeling (PRM), and retrieval-augmented fine-tuning (RAFT) to enhance both the functional correctness and reasoning fidelity of LLM-based RTL code generation. StepPRM-RTL constructs stepwise reasoning trajectories from canonical solutions, where each step contains a rationale and incremental code modification. A Process Reward Model (PRM) evaluates intermediate steps, providing dense feedback that guides reinforcement-style updates during RAFT fine-tuning. Monte Carlo Tree Search (MCTS) explores alternative reasoning paths, enriching the training dataset with high-quality trajectories. This integration of stepwise and outcome-aware rewards allows the model to learn both how and why to construct correct RTL, improving long-horizon reasoning beyond standard supervised or outcome-based training. Experimental evaluation on benchmark Verilog and VHDL datasets demonstrates that StepPRM-RTL outperforms the best prior methods by over 10\% in functional correctness and reasoning fidelity metrics. Ablation studies confirm that the combination of PRM-guided rewards and stepwise trajectory exploration is key to its performance. StepPRM-RTL generalizes across RTL languages and provides a scalable framework for high-fidelity, interpretable code generation, establishing a new standard for LLM-assisted hardware design automation.

翻译：数字硬件设计中RTL代码的自动生成仍面临挑战，这主要源于Verilog和VHDL在长程推理、多步骤依赖以及严格正确性约束方面的要求。我们提出StepPRM-RTL这一新型框架，该框架结合了逐步轨迹建模、过程奖励建模（PRM）和检索增强微调（RAFT），旨在提升基于大语言模型（LLM）的RTL代码生成功能正确性与推理保真度。StepPRM-RTL从规范解中构建逐步推理轨迹，其中每一步包含推理依据与增量代码修改。过程奖励模型（PRM）对中间步骤进行评估，提供密集反馈以指导RAFT微调期间的强化式更新。蒙特卡洛树搜索（MCTS）探索替代推理路径，从而用高质量轨迹丰富训练数据集。这种结合逐步奖励与结果感知奖励的集成方式，使模型能够学习如何以及为何构建正确的RTL，从而在超越标准监督式或结果导向式训练的基础上，提升长程推理能力。在基准Verilog和VHDL数据集上的实验评估表明，StepPRM-RTL在功能正确性与推理保真度指标上比最佳现有方法提升超过10%。消融研究证实，PRM引导奖励与逐步轨迹探索的结合是其性能的关键。StepPRM-RTL可跨RTL语言泛化，并为高保真、可解释的代码生成提供了可扩展框架，为基于大语言模型的硬件设计自动化树立了新标准。