Synthesis-in-the-Loop Evaluation of LLMs for RTL Generation: Quality, Reliability, and Failure Modes

RTL generation is more than code synthesis. Designs must be syntactically valid, synthesizable, correct, hardware-efficient. SOTA evaluations stop at functional correctness and do not measure synthesis and implementation quality. This paper evaluates 32 language models on 202 Verilog tasks from VerilogEval and RTLLM using the Hardware Quality Index (HQI) that combines post-synthesis area, delay, and warnings related to expert references in a Nangate45 45\,nm flow. Three performance regimes emerge: 14 frontier models achieve HQI $>$ 66, led by Gemini-3-Pro at 87.5\% coverage and 85.1 HQI; 15 models cluster 43--66 HQI; 3 are below 43. Gap between best-of-five capability and single-attempt quality spans 3.7--22.1 HQI points, limiting integration into agentic pipelines. A taxonomy of 195 synthesis failures reveals systematic divergence: proprietary models fail late through elaboration errors and synthesis timeout; open models fail early often due to missing module wrappers and non-synthesizable constructs, a pattern consistent with training corpora skewed toward simulation over synthesis-grade RTL.

翻译：RTL生成不仅仅是代码合成。设计必须满足语法有效、可综合、正确且硬件高效。现有顶尖评估仅止于功能正确性，未衡量综合与实现质量。本文采用硬件质量指数（HQI），在Nangate45 45nm流程中结合后综合面积、延迟及与专家参考相关的告警，对32个语言模型在VerilogEval和RTLLM的202个Verilog任务上进行评估。结果显示三种性能区间：14个前沿模型HQI大于66，由Gemini-3-Pro以87.5%覆盖率和85.1 HQI领先；15个模型聚集在43-66 HQI区间；3个模型低于43。五次最佳能力与单次尝试质量之间的差距达3.7-22.1 HQI点，限制了其在智能体流水线中的集成。对195个综合失败的分类揭示系统性差异：专有模型在精细化错误和综合超时阶段较晚失败；开源模型常因缺失模块包装器和不可综合结构较早失败，这一模式与训练语料偏向仿真而非综合级RTL（寄存器传输级）的特点一致。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【ACL2025教程】LLM时代的合成数据，228页slides

专知会员服务

31+阅读 · 2025年7月30日

大规模语言模型生成的合成数据中的质量、多样性与复杂性效应综述

专知会员服务

32+阅读 · 2024年12月10日

基于大语言模型（LLM）的合成数据生成、策展和评估的综述

专知会员服务

63+阅读 · 2024年7月5日