While Large Language Models (LLMs) show significant potential in hardware engineering, current benchmarks suffer from saturation and limited task diversity, failing to reflect LLMs' performance in real industrial workflows. To address this gap, we propose a comprehensive benchmark for AI-aided chip design that rigorously evaluates LLMs across three critical tasks: Verilog generation, debugging, and reference model generation. Our benchmark features 44 realistic modules with complex hierarchical structures, 89 systematic debugging cases, and 132 reference model samples across Python, SystemC, and CXXRTL. Evaluation results reveal substantial performance gaps, with state-of-the-art Claude-4.5-opus achieving only 30.74\% on Verilog generation and 13.33\% on Python reference model generation, demonstrating significant challenges compared to existing saturated benchmarks where SOTA models achieve over 95\% pass rates. Additionally, to help enhance LLM reference model generation, we provide an automated toolbox for high-quality training data generation, facilitating future research in this underexplored domain. Our code is available at https://github.com/zhongkaiyu/ChipBench.git.
翻译:尽管大型语言模型(LLM)在硬件工程领域展现出巨大潜力,但现有基准测试存在性能饱和与任务多样性不足的问题,无法真实反映LLM在实际工业流程中的表现。为填补这一空白,我们提出了一个面向AI辅助芯片设计的综合性基准测试,该基准通过Verilog代码生成、调试和参考模型生成三项关键任务对LLM进行严格评估。我们的基准包含44个具有复杂层次结构的真实模块、89个系统化调试案例以及涵盖Python、SystemC和CXXRTL的132个参考模型样本。评估结果显示当前模型存在显著性能差距:最先进的Claude-4.5-opus在Verilog生成任务中仅达到30.74%的通过率,在Python参考模型生成任务中仅为13.33%,这与现有饱和基准测试中SOTA模型超过95%的通过率形成鲜明对比,揭示了该领域面临的重大挑战。此外,为提升LLM参考模型生成能力,我们提供了自动化工具箱用于生成高质量训练数据,以促进这一尚未充分探索领域的研究发展。代码已开源:https://github.com/zhongkaiyu/ChipBench.git。