Large language models (LLMs) serve as powerful tools for design, providing capabilities for both task automation and design assistance. Recent advancements have shown tremendous potential for facilitating LLM integration into the chip design process; however, many of these works rely on data that are not publicly available and/or not permissively licensed for use in LLM training and distribution. In this paper, we present a solution aimed at bridging this gap by introducing an open-source dataset tailored for OpenROAD, a widely adopted open-source EDA toolchain. The dataset features over 1000 data points and is structured in two formats: (i) a pairwise set comprised of question prompts with prose answers, and (ii) a pairwise set comprised of code prompts and their corresponding OpenROAD scripts. By providing this dataset, we aim to facilitate LLM-focused research within the EDA domain. The dataset is available at https://github.com/OpenROAD-Assistant/EDA-Corpus.
翻译:大型语言模型(LLM)作为设计领域的强大工具,兼具任务自动化与设计辅助能力。近期的研究进展展现出将LLM集成到芯片设计流程中的巨大潜力,但许多相关研究依赖的数据并非公开可用,或未获得可用于LLM训练与分发的开放许可。本文提出一种解决方案,旨在弥合这一差距:我们为广泛采用的开源EDA工具链OpenROAD构建了一个开源数据集。该数据集包含超过1000个数据点,采用两种格式组织:(i)由问题提示与散文式答案组成的配对集;(ii)由代码提示及其对应的OpenROAD脚本组成的配对集。通过提供此数据集,我们旨在促进EDA领域中以LLM为核心的研究。数据集可从https://github.com/OpenROAD-Assistant/EDA-Corpus获取。