We present HDLFORGE, a two-stage multi-agent framework for automated Verilog generation that optimizes the trade-off between generation speed and accuracy. The system uses a compact coder with a medium-sized LLM by default (Stage A) and escalates to a stronger coder with an ultra-large LLM (Stage B) only when needed, guided by a calibrated score from inexpensive diagnostics including compilation, lint, and smoke tests. A key innovation is a counterexample-guided formal agent that converts bounded-model-checking traces into reusable micro-tests, significantly reducing bug detection time and repair iterations. The portable escalation controller can wrap existing Verilog LLM pipelines without modifying their internals. Evaluated on VerilogEval Human, VerilogEval V2, and RTLLM benchmarks, HDLFORGE demonstrates improved accuracy-latency trade-offs compared to single-stage systems through comprehensive analysis of wall-clock time distributions, escalation thresholds, and agent ablations. On VerilogEval Human and VerilogEval V2, HDLFORGE-Qwen achieves 91.2% and 91.8% Pass@1 with roughly 50% lower median latency, dramatically improving accuracy over other medium-sized models, and 97.2% Pass@5 on RTLLM.
翻译:本文提出HDLFORGE,一种用于自动化Verilog生成的两阶段多智能体框架,旨在优化生成速度与准确性之间的权衡。该系统默认采用配备中等规模大语言模型的紧凑型编码器(阶段A),仅在必要时通过基于廉价诊断(包括编译、静态检查与冒烟测试)的校准分数引导,升级至配备超大规模大语言模型的强化编码器(阶段B)。核心创新在于引入反例引导的形式化验证智能体,其将有限模型检验轨迹转化为可复用的微测试,显著减少了错误检测时间与修复迭代次数。该可移植的升级控制器能够封装现有Verilog大语言模型流水线而无需修改其内部结构。通过在VerilogEval Human、VerilogEval V2及RTLLM基准测试上的评估,结合对实际运行时间分布、升级阈值与智能体消融实验的综合分析,HDLFORGE展现出相较于单阶段系统更优的准确率-延迟权衡特性。在VerilogEval Human和VerilogEval V2测试中,HDLFORGE-Qwen分别实现了91.2%与91.8%的Pass@1通过率,中位延迟降低约50%,较其他中等规模模型显著提升准确率,并在RTLLM基准上取得97.2%的Pass@5通过率。