Large language models (LLMs) have made rapid advancements in code generation for popular languages such as Python and C++. Many of these recent gains can be attributed to the use of ``agents'' that wrap domain-relevant tools alongside LLMs. Hardware design languages such as Verilog have also seen improved code generation in recent years, but the impact of agentic frameworks on Verilog code generation tasks remains unclear. In this work, we present the first systematic evaluation of agentic LLMs for Verilog generation, using the recently introduced CVDP benchmark. We also introduce several open-source hardware design agent harnesses, providing a model-agnostic baseline for future work. Through controlled experiments across frontier models, we study how structured prompting and tool design affect performance, analyze agent failure modes and tool usage patterns, compare open-source and closed-source models, and provide qualitative examples of successful and failed agent runs. Our results show that naive agentic wrapping around frontier models can degrade performance (relative to standard forward passes with optimized prompts), but that structured harnesses meaningfully match and in some cases exceed non-agentic baselines. We find that the performance gap between open and closed source models is driven by both higher crash rates and weaker tool output interpretation. Our exploration illuminates the path towards designing special-purpose agents for verilog generation in the future.
翻译:大型语言模型(LLMs)在Python和C++等流行语言的代码生成方面取得了快速进展。这些近期成果大多归功于使用“智能体”——即LLMs与领域相关工具相结合的封装。硬件设计语言(如Verilog)近年来在代码生成方面也取得了改进,但智能体框架对Verilog代码生成任务的影响尚不明确。在本工作中,我们首次利用最近引入的CVDP基准测试,对Verilog生成的智能体LLMs进行了系统性评估。我们还引入了多个开源硬件设计智能体框架,为未来工作提供了与模型无关的基线。通过在前沿模型中进行受控实验,我们研究了结构化提示和工具设计如何影响性能,分析了智能体的失败模式及工具使用模式,比较了开源与闭源模型,并提供了成功与失败智能体运行的定性实例。我们的结果表明,对前沿模型进行简单的智能体封装可能降低性能(相较于使用优化提示的标准前向传播),但结构化框架能有效匹配甚至在某些情况下超越非智能体基线。我们发现,开源与闭源模型之间的性能差距是由更高的崩溃率和较弱的工具输出解释能力共同驱动的。我们的探索为未来设计用于Verilog生成的专用智能体指明了方向。