Fine-grain clock gating (FGCG) is among the most effective techniques for reducing dynamic power, yet current FGCG optimization flows remain largely manual. Recent LLM-based RTL optimization approaches remain limited by two key drawbacks: (1) the inability to process long waveform traces spanning millions of cycles, and (2) the difficulty of scaling optimization to large hierarchical codebases while preserving correctness. In this work, we present AUTOGATE, the first agentic framework for industry-grade RTL power optimization, enabling workload-aware clock-gating optimization across large hierarchical codebases. AUTOGATE introduces a Machine Learning (ML)-LLM co-design that bridges waveform-level analysis and RTL rewriting. Specifically, we design an ML-based clustering algorithm that distills raw toggling traces into compact, structured representations that guide LLM-based RTL rewriting. This enables accurate identification and application of clock-gating opportunities without requiring LLMs to directly process raw waveform data. To enhance scalability, AUTOGATE employs a hierarchical multi-agent architecture that decomposes large designs into independently optimizable modules, enabling coordinated optimization across deep design hierarchies. We evaluate AUTOGATE on a diverse set of designs ranging from small RTL designs to large industrial-grade codebases. Experimental results show that AUTOGATE consistently reduces dynamic power relative to baselines. Across the small-design suite, AUTOGATE reduces dynamic power by 49.31% on average. On industry-scale designs, it achieves 19.34% and 7.96% dynamic power reductions on NVDLA and BlackParrot, respectively, and up to 6.86% on highly optimized proprietary production designs.
翻译:细粒度时钟门控是降低动态功耗的最有效技术之一,但目前的细粒度时钟门控优化流程仍主要依赖人工操作。近年来基于大语言模型的RTL优化方法存在两个关键局限:(1)无法处理跨越数百万个时钟周期的长波形序列,(2)难以在保持正确性的前提下将优化扩展至大规模层次化代码库。本文提出AUTOGATE,这是首个面向工业级RTL功耗优化的智能体框架,能够在大规模层次化代码库中实现工作负载感知的时钟门控优化。AUTOGATE采用机器学习与大语言模型协同设计,桥接波形级分析与RTL重写。具体而言,我们设计了一种基于机器学习的聚类算法,将原始翻转轨迹压缩为紧凑的结构化表示,用于指导基于大语言模型的RTL重写。该方法无需大语言模型直接处理原始波形数据,即可准确识别并应用时钟门控机会。为增强可扩展性,AUTOGATE采用层次化多智能体架构,将大型设计分解为可独立优化的模块,实现深层设计层次间的协同优化。我们在从小型RTL设计到大型工业级代码库的多样化设计上评估了AUTOGATE。实验结果表明,与基线相比,AUTOGATE能够持续降低动态功耗:在小型设计集上平均降低49.31%;在工业级设计上,对NVDLA和BlackParrot分别实现19.34%和7.96%的动态功耗降低;在高度优化的专有生产设计中实现高达6.86%的功耗降低。