Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench

We ask whether agentic AI systems built for software engineering transfer to realistic hardware engineering. Existing hardware LLM benchmarks isolate sub-tasks but none jointly requires repository navigation, hierarchy-aware localization, Electronic Design Automation (EDA) executable verification, and maintenance-style patching. We introduce \textbf{Phoenix-bench}, a synchronized corpus of 511 verified Verilator instances from 114 GitHub repositories, each shipped with the developer patch, design-flow labels, fail-to-pass and pass-to-pass testbenches, and a Docker-pinned EDA environment so resolved-rate differences reflect agent behavior rather than toolchain availability. Using Phoenix-bench we run a uniform evaluation of four commercial agents and eight open-source agentic structures across four LLM backbones, plus two diagnostic interventions (file-level oracle localization and one round of testbench-log feedback). Three findings emerge. (i)~Software and hardware are fundamentally different engineering tasks: the same agent loses 37\% to 58\% from SWE-bench Verified to Phoenix-bench because hardware bugs propagate across parallel instantiated modules through signal flow rather than along a software-style call graph, and software-tuned agents stop at the symptom file instead of tracing back through the instantiation chain. (ii)~Failures concentrate on design control-flow / finite state machine (FSM) bugs, verification testbench bugs, and hard cases that demand cross-hierarchy signal-flow tracking and coordinated multi-file edits. (iii)~Localization granularity matters far more than localization itself: a perfect file-level oracle yields only $+1.4$\% because the agent then breaks files that did not need editing, while a single round of test case feedback lifts resolved rate by $42$\% to $45$\% because the test case tells \emph{where} the bug is and \emph{what} the fix has to look like.

翻译：我们探究面向软件工程构建的自主AI系统能否迁移至真实的硬件工程任务。现有硬件领域大语言模型基准测试仅分离子任务进行评估，但尚未有任何基准同时要求仓库导航、层级感知定位、电子设计自动化（EDA）可执行验证以及维护模式补丁生成。为此，我们提出**Phoenix-bench**——一个包含来自114个GitHub仓库的511个经校验的Verilator实例的同步语料库。每个实例均附带开发者补丁、设计流程标签、故障转通过与通过转通过测试平台，以及Docker容器化的EDA开发环境，从而确保解决率差异仅反映自主AI的行为差异而非工具链可用性差异。利用Phoenix-bench，我们对四种商用自主AI和八种开源自主AI架构（基于四种LLM骨干网络）进行统一评估，并实施两种诊断干预（文件级定位最优预测与单轮测试平台日志反馈）。研究得出三项发现：（i）软件与硬件本质属于不同工程任务：同一自主AI从SWE-bench Verified迁移至Phoenix-bench后性能下降37%至58%，原因在于硬件缺陷通过信号流在并行实例化模块间传播，而非遵循软件调用图路径，且针对软件优化的自主AI会停滞于症状文件而非沿实例化链回溯定位根本原因；（ii）失败集中于设计控制流/有限状态机（FSM）缺陷、验证测试平台缺陷，以及需要跨层级信号流追踪与协调多文件编辑的复杂情形；（iii）定位粒度的重要性远超定位本身：完美文件级最优定位仅使解决率提升1.4%，因自主AI后续会破坏无需修改的文件；而单轮测试用例反馈使解决率提升42%至45%，因测试用例同时揭示了缺陷位置与修复方向。