Despite the growing interest in ML-guided EDA tools from RTL to GDSII, there are no standard datasets or prototypical learning tasks defined for the EDA problem domain. Experience from the computer vision community suggests that such datasets are crucial to spur further progress in ML for EDA. Here we describe our experience curating two large-scale, high-quality datasets for Verilog code generation and logic synthesis. The first, VeriGen, is a dataset of Verilog code collected from GitHub and Verilog textbooks. The second, OpenABC-D, is a large-scale, labeled dataset designed to aid ML for logic synthesis tasks. The dataset consists of 870,000 And-Inverter-Graphs (AIGs) produced from 1500 synthesis runs on a large number of open-source hardware projects. In this paper we will discuss challenges in curating, maintaining and growing the size and scale of these datasets. We will also touch upon questions of dataset quality and security, and the use of novel data augmentation tools that are tailored for the hardware domain.
翻译:尽管从RTL到GDSII的ML引导EDA工具日益受到关注,但EDA问题领域仍缺乏标准数据集或原型学习任务。计算机视觉领域的经验表明,此类数据集对推动ML在EDA领域的进一步发展至关重要。本文介绍了我们在Verilog代码生成和逻辑综合方面整理两个大规模、高质量数据集的经验。第一个数据集VeriGen是从GitHub和Verilog教科书收集的Verilog代码数据集。第二个数据集OpenABC-D是一个大规模、带标签的数据集,旨在辅助逻辑综合任务的机器学习。该数据集包含来自大量开源硬件项目的1500次综合运行产生的870,000个与逆变器图(AIG)。本文将讨论在整理、维护和扩展这些数据集的规模与范围方面面临的挑战。我们还将探讨数据集质量与安全性问题,以及针对硬件领域定制的新型数据增强工具的应用。