LLM-based agents are increasingly applied to the "last mile" of Electronic Design Automation (EDA): repairing residual sign-off Design Rule Check (DRC) violations and converging Power-Performance-Area (PPA) targets after tool runs. Existing EDA-LLM benchmarks, however, omit DRC fixing entirely and rely on flat hierarchies tied to a single toolchain. We introduce PostEDA-Bench, a hierarchical benchmark with 145 tasks across DRC-Essential, DRC-Reasoning, PPA-Mono, and PPA-Multi, supported by EDA toolchains with machine-checkable evaluation. Across eight commercial and open-source LLMs under multiple agent scaffolds, we find that agents handle synthetic DRC-Essential and single-objective PPA-Mono reasonably well but degrade sharply on the more practical DRC-Reasoning, where the best success rate is 36.66%, and PPA-Multi, where the best success rate is 20.00%; vision augmentation consistently enhances DRC-Bench; and trade-off reasoning, rather than knob knowledge, is the dominant PPA-Multi bottleneck.
翻译:基于大语言模型(LLM)的智能体正越来越多地应用于电子设计自动化(EDA)的"最后一公里":即修复工具运行后残留的签核级设计规则检查违规,并收敛功耗-性能-面积目标。然而,现有EDA-LLM基准完全忽略了DRC修正,且依赖与单一工具链绑定的扁平化层次结构。我们提出PostEDA-Bench,一个包含145项任务的分层基准,覆盖DRC基础、DRC推理、PPA单目标与PPA多目标四个层面,并由支持机器可判定评估的EDA工具链提供支撑。在多种智能体框架下对八款商业与开源LLM的测试表明:智能体能较好地处理合成型DRC基础任务和单目标PPA单目标任务,但在更具实用性的DRC推理(最佳成功率36.66%)与PPA多目标(最佳成功率20.00%)任务上性能急剧下降;视觉增强始终能提升DRC基准表现;而权衡推理(而非参数知识)是PPA多目标任务的主要瓶颈。