While many EDA tasks already involve graph-based data, existing LLMs in EDA primarily either represent graphs as sequential text, or simply ignore graph-structured data that might be beneficial like dataflow graphs of RTL code. Recent studies have found that LLM performance suffers when graphs are represented as sequential text, and using additional graph information significantly boosts performance. To address these challenges, we introduce BRIDGES, a framework designed to incorporate graph modality into LLMs for EDA tasks. BRIDGES integrates an automated data generation workflow, a solution that combines graph modality with LLM, and a comprehensive evaluation suite. First, we establish an LLM-driven workflow to generate RTL and netlist-level data, converting them into dataflow and netlist graphs with function descriptions. This workflow yields a large-scale dataset comprising over 500,000 graph instances and more than 1.5 billion tokens. Second, we propose a lightweight cross-modal projector that encodes graph representations into text-compatible prompts, enabling LLMs to effectively utilize graph data without architectural modifications. Experimental results demonstrate 2x to 10x improvements across multiple tasks compared to text-only baselines, including accuracy in design retrieval, type prediction and perplexity in function description, with negligible computational overhead (<1% model weights increase and <30% additional runtime overhead). Even without additional LLM finetuning, our results outperform text-only by a large margin. We plan to release BRIDGES, including the dataset, models, and training flow.
翻译:尽管许多EDA任务已涉及基于图的数据,但现有的EDA领域大型语言模型主要将图表示为序列文本,或直接忽略可能有益的结构化图数据(如RTL代码的数据流图)。近期研究发现,将图表示为序列文本会损害LLM性能,而利用额外的图信息能显著提升性能。为应对这些挑战,我们提出了BRIDGES框架,旨在将图模态整合到面向EDA任务的LLM中。BRIDGES集成了自动化数据生成流程、图模态与LLM结合的解决方案以及综合评估套件。首先,我们建立了LLM驱动的流程来生成RTL和网表级数据,并将其转换为带功能描述的数据流图和网表图。该流程产生了包含超过50万个图实例和15亿个token的大规模数据集。其次,我们提出了一种轻量级跨模态投影器,将图表示编码为与文本兼容的提示,使LLM无需修改架构即可有效利用图数据。实验结果表明,在多项任务中(包括设计检索的准确率、类型预测以及功能描述的困惑度)相比纯文本基线实现了2倍至10倍的性能提升,且计算开销可忽略不计(模型权重增加<1%,额外运行时间开销<30%)。即使不进行额外的LLM微调,我们的结果仍大幅优于纯文本方法。我们计划开源BRIDGES,包括数据集、模型及训练流程。