Learning effective netlist representations is fundamentally constrained by the scarcity of labeled datasets, as real designs are protected by Intellectual Property (IP) and costly to annotate. Existing work therefore focuses on small-scale circuits with clean labels, limiting scalability to realistic designs. Meanwhile, Large Language Models (LLMs) can generate Register-Transfer-Level (RTL) at scale, but their functional incorrectness has hindered their use in circuit analysis. In this work, we make a key observation: even when LLM-Generated RTL is functionally imperfect, the synthesized netlists still preserve structural patterns that are strongly indicative of the intended functionality. Building on this insight, we propose a cost-effective data augmentation and training framework that systematically exploits imperfect LLM-Generated RTL as training data for netlist representation learning, forming an end-to-end pipeline from automated code generation to downstream tasks. We conduct evaluations on circuit functional understanding tasks, including sub-circuit boundary identification and component classification, across benchmarks of increasing scales, extending the task scope from operator-level to IP-level. The evaluations demonstrate that models trained on our noisy synthetic corpus generalize well to real-world netlists, matching or even surpassing methods trained on scarce high-quality data and effectively breaking the data bottleneck in circuit representation learning.
翻译:学习有效的网表表示从根本上受到标记数据集稀缺性的制约,因为真实设计受知识产权保护且标注成本高昂。因此现有工作主要集中于具有干净标签的小规模电路,限制了向实际设计的可扩展性。与此同时,大型语言模型能够大规模生成寄存器传输级代码,但其功能正确性问题阻碍了它们在电路分析中的应用。本工作中,我们提出一个关键发现:即使LLM生成的RTL在功能上不完美,其综合后的网表仍保留了能强烈反映预期功能的结构模式。基于这一洞见,我们提出一种经济高效的数据增强与训练框架,系统性地利用不完美的LLM生成RTL作为网表表示学习的训练数据,形成从自动代码生成到下游任务的端到端流程。我们在电路功能理解任务上开展评估,包括子电路边界识别和组件分类,覆盖从算子级到IP级逐步扩展的基准测试。评估结果表明,基于我们带噪声合成语料库训练的模型能良好泛化到真实网表,其性能匹配甚至超越了基于稀缺高质量数据训练的方法,有效突破了电路表示学习中的数据瓶颈。