ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration

Large Language Models (LLMs) show promise for generating Register-Transfer Level (RTL) code from natural language specifications, but single-shot generation achieves only 60-65% functional correctness on standard benchmarks. Multi-agent approaches such as MAGE reach 95.9% on VerilogEval yet remain untested on harder industrial benchmarks such as NVIDIA's CVDP, lack synthesis awareness, and incur high API costs. We present ChipCraftBrain, a framework combining symbolic-neural reasoning with adaptive multi-agent orchestration for automated RTL generation. Four innovations drive the system: (1) adaptive orchestration over six specialized agents via a PPO policy over a 168-dim state (an alternative world-model MPC planner is also evaluated); (2) a hybrid symbolic-neural architecture that solves K-map and truth-table problems algorithmically while specialized agents handle waveform timing and general RTL; (3) knowledge-augmented generation from a 321-pattern base plus 971 open-source reference implementations with focus-aware retrieval; and (4) hierarchical specification decomposition into dependency-ordered sub-modules with interface synchronization. On VerilogEval-Human, ChipCraftBrain achieves 97.2% mean pass@1 (range 96.15-98.72% across 7 runs, best 154/156), on par with ChipAgents (97.4%, self-reported) and ahead of MAGE (95.9%). On a 302-problem non-agentic subset of CVDP spanning five task categories, we reach 94.7% mean pass@1 (286/302, averaged over 3 runs), a 36-60 percentage-point lift per category over the published single-shot baseline; we additionally lead three of four categories shared with NVIDIA's ACE-RTL despite using roughly 30x fewer per-problem attempts. A RISC-V SoC case study demonstrates hierarchical decomposition generating 8/8 lint-passing modules (689 LOC) validated on FPGA, where monolithic generation fails entirely.

翻译：大规模语言模型（LLMs）虽展现出从自然语言规范生成寄存器传输级（RTL）代码的潜力，但单次生成在标准基准测试中仅能达到60-65%的功能正确性。MAGE等多智能体方法在VerilogEval上虽达到95.9%的正确率，却未在NVIDIA CVDP等更严苛的工业基准上测试，且缺乏综合感知能力，同时伴随高昂的API成本。我们提出ChipCraftBrain框架，将符号-神经推理与自适应多智能体编排相结合，实现自动化RTL生成。该系统包含四项创新：（1）通过基于168维状态空间的PPO策略（另评估了替代世界模型MPC规划器）对六个专业智能体进行自适应编排；（2）混合符号-神经架构：对K-map和真值表问题进行算法求解，同时由专业智能体处理波形时序与通用RTL；（3）知识增强生成：基于321种模式基础库与971个开源参考实现，结合焦点感知检索技术；（4）层级化规范分解：将设计拆解为依赖排序的子模块，并实现接口同步。在VerilogEval-Human基准上，ChipCraftBrain达到平均97.2%的pass@1（7次运行区间96.15-98.72%，最优154/156），与ChipAgents（97.4%，自报结果）持平，超越MAGE（95.9%）。在涵盖五类设计任务的CVDP非智能体子集（含302个问题）上，我们获得平均94.7%的pass@1（286/302，三次运行平均），每类任务较已发表的单次生成基线提升36-60个百分点；此外，在与NVIDIA ACE-RTL共享的四类任务中，我们领先其中三类，且每问题尝试次数减少约30倍。RISC-V SoC案例研究表明，层级化解构生成8个通过lint检查的模块（689行代码），并在FPGA上验证通过，而整体式生成则完全失败。