基于状态突变的GRPO：改进基于LLM的硬件测试计划生成 (GRPO with State Mutations: Improving LLM-Based Hardware Test Plan Generation)

RTL design often relies heavily on ad-hoc testbench creation early in the design cycle. While large language models (LLMs) show promise for RTL code generation, their ability to reason about hardware specifications and generate targeted test plans remains largely unexplored. We present the first systematic study of LLM reasoning capabilities for RTL verification stimuli generation, establishing a two-stage framework that decomposes test plan generation from testbench execution. Our benchmark reveals that state-of-the-art models, including DeepSeek-R1 and Claude-4.0-Sonnet, achieve only 15.7-21.7% success rates on generating stimuli that pass golden RTL designs. To improve LLM generated stimuli, we develop a comprehensive training methodology combining supervised fine-tuning with a novel reinforcement learning approach, GRPO with State Mutation (GRPO-SMu), which enhances exploration by varying input mutations. Our approach leverages a tree-based branching mutation strategy to construct training data comprising equivalent and mutated trees, moving beyond linear mutation approaches to provide rich learning signals. Training on this curated dataset, our 7B parameter model achieves a 33.3% golden test pass rate and a 13.9% mutation detection rate, representing a 17.6% absolute improvement over baseline and outperforming much larger general-purpose models. These results demonstrate that specialized training methodologies can significantly enhance LLM reasoning capabilities for hardware verification tasks, establishing a foundation for automated sub-unit testing in semiconductor design workflows.

翻译：RTL设计在设计周期早期通常严重依赖临时测试平台的创建。虽然大型语言模型（LLM）在RTL代码生成方面展现出潜力，但其理解硬件规格并生成针对性测试计划的能力仍很大程度上未被探索。我们首次对LLM在RTL验证激励生成中的推理能力进行了系统性研究，建立了一个两阶段框架，将测试计划生成与测试平台执行解耦。我们的基准测试表明，包括DeepSeek-R1和Claude-4.0-Sonnet在内的最先进模型，在生成能通过黄金RTL设计的激励方面仅达到15.7-21.7%的成功率。为改进LLM生成的激励，我们开发了一种结合监督微调与新型强化学习方法的综合训练方法——基于状态突变的GRPO（GRPO-SMu），该方法通过变化输入突变来增强探索。我们的方法利用基于树的分支突变策略构建包含等价树和突变树的训练数据，超越了线性突变方法以提供丰富的学习信号。在此精选数据集上训练后，我们的70亿参数模型实现了33.3%的黄金测试通过率和13.9%的突变检测率，相比基线绝对提升了17.6%，并优于规模大得多的通用模型。这些结果表明，专门化的训练方法能显著增强LLM在硬件验证任务中的推理能力，为半导体设计流程中的自动化子单元测试奠定了基础。