LLM-based agents are increasingly applied to the "last mile" of Electronic Design Automation (EDA): repairing residual sign-off Design Rule Check (DRC) violations and converging Power-Performance-Area (PPA) targets after tool runs. Existing EDA-LLM benchmarks, however, omit DRC fixing entirely and rely on flat hierarchies tied to a single toolchain. We introduce PostEDA-Bench, a hierarchical benchmark with 145 tasks across DRC-Essential, DRC-Reasoning, PPA-Mono, and PPA-Multi, supported by EDA toolchains with machine-checkable evaluation. Across eight commercial and open-source LLMs under multiple agent scaffolds, we find that agents handle synthetic DRC-Essential and single-objective PPA-Mono reasonably well but degrade sharply on the more practical DRC-Reasoning, where the best success rate is 36.66%, and PPA-Multi, where the best success rate is 20.00%; vision augmentation consistently enhances DRC-Bench; and trade-off reasoning, rather than knob knowledge, is the dominant PPA-Multi bottleneck.
翻译:基于大语言模型(LLM)的智能体正越来越多地应用于电子设计自动化(EDA)的"最后一英里":即在工具运行后修复残留的签核级设计规则检查(DRC)违规,并收敛功耗-性能-面积(PPA)目标。然而,现有EDA-LLM基准测试完全忽略了DRC修复,且依赖于与单一工具链绑定的扁平化层次结构。我们提出PostEDA-Bench,这是一个包含145个任务的分层基准测试,覆盖DRC基础、DRC推理、PPA单目标和PPA多目标四个领域,并得到支持机器可验证评估的EDA工具链支持。在多种智能体框架下的八个商业及开源LLM测试中,我们发现:智能体在合成DRC基础任务和单目标PPA单目标任务上表现尚可,但在更实际的DRC推理(最佳成功率为36.66%)和PPA多目标(最佳成功率为20.00%)任务中性能急剧下降;视觉增强持续提升DRC基准测试表现;而权衡推理(而非调节参数知识)是PPA多目标的主要瓶颈。