RTL generation is more than code synthesis. Designs must be syntactically valid, synthesizable, correct, hardware-efficient. SOTA evaluations stop at functional correctness and do not measure synthesis and implementation quality. This paper evaluates 32 language models on 202 Verilog tasks from VerilogEval and RTLLM using the Hardware Quality Index (HQI) that combines post-synthesis area, delay, and warnings related to expert references in a Nangate45 45\,nm flow. Three performance regimes emerge: 14 frontier models achieve HQI $>$ 66, led by Gemini-3-Pro at 87.5\% coverage and 85.1 HQI; 15 models cluster 43--66 HQI; 3 are below 43. Gap between best-of-five capability and single-attempt quality spans 3.7--22.1 HQI points, limiting integration into agentic pipelines. A taxonomy of 195 synthesis failures reveals systematic divergence: proprietary models fail late through elaboration errors and synthesis timeout; open models fail early often due to missing module wrappers and non-synthesizable constructs, a pattern consistent with training corpora skewed toward simulation over synthesis-grade RTL.
翻译:RTL生成不仅仅是代码合成。设计必须满足语法有效、可综合、正确且硬件高效。现有顶尖评估仅止于功能正确性,未衡量综合与实现质量。本文采用硬件质量指数(HQI),在Nangate45 45nm流程中结合后综合面积、延迟及与专家参考相关的告警,对32个语言模型在VerilogEval和RTLLM的202个Verilog任务上进行评估。结果显示三种性能区间:14个前沿模型HQI大于66,由Gemini-3-Pro以87.5%覆盖率和85.1 HQI领先;15个模型聚集在43-66 HQI区间;3个模型低于43。五次最佳能力与单次尝试质量之间的差距达3.7-22.1 HQI点,限制了其在智能体流水线中的集成。对195个综合失败的分类揭示系统性差异:专有模型在精细化错误和综合超时阶段较晚失败;开源模型常因缺失模块包装器和不可综合结构较早失败,这一模式与训练语料偏向仿真而非综合级RTL(寄存器传输级)的特点一致。