Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?

We present an empirical study of how far general-purpose coding agents -- without hardware-specific training -- can optimize hardware designs from high-level algorithmic specifications. We introduce an agent factory, a two-stage pipeline that constructs and coordinates multiple autonomous optimization agents. In Stage~1, the pipeline decomposes a design into sub-kernels, independently optimizes each using pragma and code-level transformations, and formulates an Integer Linear Program (ILP) to assemble globally promising configurations under an area constraint. In Stage~2, it launches $N$ expert agents over the top ILP solutions, each exploring cross-function optimizations such as pragma recombination, loop fusion, and memory restructuring that are not captured by sub-kernel decomposition. We evaluate the approach on 12 kernels from HLS-Eval and Rodinia-HLS using Claude Code (Opus~4.5/4.6) with AMD Vitis HLS. Scaling from 1 to 10 agents yields a mean $8.27\times$ speedup over baseline, with larger gains on harder benchmarks: streamcluster exceeds $20\times$ and kmeans reaches approximately $10\times$. Across benchmarks, agents consistently rediscover known hardware optimization patterns without domain-specific training, and the best designs often do not originate from top-ranked ILP candidates, indicating that global optimization exposes improvements missed by sub-kernel search. These results establish agent scaling as a practical and effective axis for HLS optimization.

翻译：我们提出了一项实证研究，探究未经硬件专用训练的通用编程智能体在从高层算法规格出发优化硬件设计方面能达到何种程度。我们引入了一种智能体工厂，这是一个两阶段流水线，用于构建并协调多个自主优化智能体。在第一阶段，该流水线将设计分解为子内核，通过编译指示（pragma）和代码级变换独立优化每个子内核，并制定整数线性规划（ILP），在面积约束下组合出全局有前景的配置。在第二阶段，它针对排名靠前的ILP解决方案启动$N$个专家智能体，每个智能体探索跨函数优化，例如编译指示重组、循环融合和内存重构——这些优化无法通过子内核分解实现。我们使用Claude Code（Opus~4.5/4.6）与AMD Vitis HLS，在HLS-Eval和Rodinia-HLS的12个内核上对该方法进行了评估。将智能体数量从1扩展到10个，相较于基线实现了平均$8.27\times$的加速，在难度更高的基准测试上获得更大收益：streamcluster超过$20\times$，kmeans达到约$10\times$。在所有基准测试中，智能体无需领域专用训练即能持续发现已知硬件优化模式，而最优设计往往并非源自排名最高的ILP候选方案，这表明全局优化暴露了子内核搜索所遗漏的改进点。这些结果确立了智能体扩展作为HLS优化的一种实用且有效的方向。