Text-to-image (T2I) generation has seen significant growth over the past few years. Despite this, there has been little work on generating diagrams with T2I models. A diagram is a symbolic/schematic representation that explains information using structurally rich and spatially complex visualizations (e.g., a dense combination of related objects, text labels, directional arrows/lines, etc.). Existing state-of-the-art T2I models often fail at diagram generation because they lack fine-grained object layout control when many objects are densely connected via complex relations such as arrows/lines, and also often fail to render comprehensible text labels. To address this gap, we present DiagrammerGPT, a novel two-stage text-to-diagram generation framework leveraging the layout guidance capabilities of LLMs to generate more accurate diagrams. In the first stage, we use LLMs to generate and iteratively refine 'diagram plans' (in a planner-auditor feedback loop). In the second stage, we use a diagram generator, DiagramGLIGEN, and a text label rendering module to generate diagrams (with clear text labels) following the diagram plans. To benchmark the text-to-diagram generation task, we introduce AI2D-Caption, a densely annotated diagram dataset built on top of the AI2D dataset. We show that our DiagrammerGPT framework produces more accurate diagrams, outperforming existing T2I models. We also provide comprehensive analysis, including open-domain diagram generation, multi-platform vector graphic diagram generation, human-in-the-loop editing, and multimodal planner/auditor LLMs.
翻译:文本到图像(T2I)生成在过去几年中取得了显著进展。尽管如此,利用T2I模型生成图表的研究仍然很少。图表是一种符号化/示意性的表示方法,它通过结构丰富且空间复杂的可视化(例如,相关对象、文本标签、方向性箭头/线条等的密集组合)来解释信息。现有的最先进T2I模型在图表生成方面常常失败,原因在于当大量对象通过箭头/线条等复杂关系密集连接时,它们缺乏细粒度的对象布局控制,并且通常无法渲染可理解的文本标签。为了弥补这一差距,我们提出了DiagrammerGPT,这是一个新颖的两阶段文本到图表生成框架,它利用大语言模型的布局引导能力来生成更准确的图表。在第一阶段,我们使用大语言模型生成并迭代优化“图表规划”(在规划器-审核器反馈循环中)。在第二阶段,我们使用图表生成器DiagramGLIGEN和一个文本标签渲染模块,根据图表规划生成带有清晰文本标签的图表。为了对文本到图表生成任务进行基准测试,我们引入了AI2D-Caption,这是一个基于AI2D数据集构建的密集标注图表数据集。我们展示了我们的DiagrammerGPT框架能够生成更准确的图表,其性能优于现有的T2I模型。我们还提供了全面的分析,包括开放领域图表生成、多平台矢量图形图表生成、人在回路编辑以及多模态规划器/审核器大语言模型。