RTL generation demands more than software code synthesis: designs must be syntactically valid, synthesizable, functionally correct, and hardware-efficient. Existing evaluations stop at functional correctness, leaving synthesizability and implementation quality unmeasured. We evaluate 32 language models on 202 Verilog tasks from VerilogEval and RTLLM, with five attempts each, scoring via the Hardware Quality Index (HQI), a 0--100 metric integrating post-synthesis area, delay, and warning count relative to expert references under a Nangate45 45\,nm flow. Three performance tiers emerge: 13 frontier models achieve Global HQI above 71, led by Gemini-3-Pro (87.5\% coverage, 85.1 HQI); 11 mid-tier models cluster at 53--68; 8 fall below 53. The capability-to-deployment gap (best-of-five vs.\ single-attempt) spans 3.8--22.1 HQI points, motivating multi-sample strategies. A tool-adjudicated taxonomy of 195 genuine synthesis failures reveals systematic divergence: proprietary models fail late through elaboration errors and synthesis timeout; open-weight models fail early through missing module wrappers and non-synthesizable constructs, consistent with training on simulation-grade rather than synthesis-grade RTL. Rankings hold across three technology libraries at Spearman~$ρ> 0.99$.
翻译:RTL生成的要求远不止软件代码合成:设计必须在语法上有效、可综合、功能正确且硬件高效。现有评估仅止步于功能正确性,未对可综合性与实现质量进行度量。我们在VerilogEval和RTLLM的202项Verilog任务上评估了32个语言模型,每个任务进行五次尝试,并通过硬件质量指数(HQI)进行评分——这是一个0至100的度量指标,在Nangate45 45\,nm工艺流下,综合了相对于专家参考设计的后综合面积、延迟和警告数量。评估结果呈现出三个性能层级:13个前沿模型的全局HQI超过71,以Gemini-3-Pro为首(覆盖率达87.5\%,HQI为85.1);11个中层级模型集中在53至68之间;8个模型低于53。能力与部署之间的差距(五次最佳尝试与单次尝试对比)横跨3.8至22.1个HQI点,这促使了多样本策略的采用。通过对195个真实综合失效案例进行工具裁定分类,揭示了系统性的差异:专有模型因细化错误和综合超时而在后期失效;开源权重模型则因缺少模块封装和非可综合结构而在早期失效,这与它们在仿真级而非综合级RTL数据上的训练一致。该排名在三种工艺库中均保持稳定,Spearman~$ρ> 0.99$。