Text-to-image (T2I) generation has seen significant growth over the past few years. Despite this, there has been little work on generating diagrams with T2I models. A diagram is a symbolic/schematic representation that explains information using structurally rich and spatially complex visualizations (e.g., a dense combination of related objects, text labels, directional arrows, connection lines, etc.). Existing state-of-the-art T2I models often fail at diagram generation because they lack fine-grained object layout control when many objects are densely connected via complex relations such as arrows/lines and also often fail to render comprehensible text labels. To address this gap, we present DiagrammerGPT, a novel two-stage text-to-diagram generation framework that leverages the layout guidance capabilities of LLMs (e.g., GPT-4) to generate more accurate open-domain, open-platform diagrams. In the first stage, we use LLMs to generate and iteratively refine 'diagram plans' (in a planner-auditor feedback loop) which describe all the entities (objects and text labels), their relationships (arrows or lines), and their bounding box layouts. In the second stage, we use a diagram generator, DiagramGLIGEN, and a text label rendering module to generate diagrams following the diagram plans. To benchmark the text-to-diagram generation task, we introduce AI2D-Caption, a densely annotated diagram dataset built on top of the AI2D dataset. We show quantitatively and qualitatively that our DiagrammerGPT framework produces more accurate diagrams, outperforming existing T2I models. We also provide comprehensive analysis including open-domain diagram generation, vector graphic diagram generation in different platforms, human-in-the-loop diagram plan editing, and multimodal planner/auditor LLMs (e.g., GPT-4Vision). We hope our work can inspire further research on diagram generation via T2I models and LLMs.
翻译:文本到图像生成在过去几年取得了显著发展,然而利用文本到图像模型生成图表的研究却鲜有涉及。图表是一种符号/示意图表示,通过结构丰富且空间复杂的可视化方式(例如密集组合的相关对象、文本标签、方向箭头、连接线等)阐释信息。现有最先进的文本到图像模型在图表生成中常表现不佳,原因在于当大量对象通过箭头/线条等复杂关系密集连接时,缺乏细粒度的对象布局控制,且难以生成可理解的文本标签。为填补这一空白,我们提出DiagrammerGPT——一种新颖的两阶段文本到图表生成框架,该框架利用大语言模型(如GPT-4)的布局引导能力,生成更准确的开放域、开放平台图表。第一阶段,我们使用大语言模型生成并迭代优化"图表规划"(通过规划器-审计器反馈循环),描述所有实体(对象和文本标签)、其关系(箭头或线条)及边界框布局。第二阶段,我们采用图表生成器DiagramGLIGEN和文本标签渲染模块,依据图表规划生成图表。为基准测试文本到图表生成任务,我们引入AI2D-Caption——基于AI2D数据集构建的密集标注图表数据集。定量与定性实验表明,我们的DiagrammerGPT框架能生成更准确的图表,性能优于现有文本到图像模型。我们还提供全面分析,包括开放域图表生成、不同平台下的矢量图形图表生成、人机协作的图表规划编辑,以及多模态规划器/审计器LLM(如GPT-4Vision)。我们希望这项工作能推动基于文本到图像模型和大语言模型的图表生成研究。