DiagrammerGPT: Generating Open-Domain, Open-Platform Diagrams via LLM Planning

Text-to-image (T2I) generation has seen significant growth over the past few years. Despite this, there has been little work on generating diagrams with T2I models. A diagram is a symbolic/schematic representation that explains information using structurally rich and spatially complex visualizations (e.g., a dense combination of related objects, text labels, directional arrows, connection lines, etc.). Existing state-of-the-art T2I models often fail at diagram generation because they lack fine-grained object layout control when many objects are densely connected via complex relations such as arrows/lines and also often fail to render comprehensible text labels. To address this gap, we present DiagrammerGPT, a novel two-stage text-to-diagram generation framework that leverages the layout guidance capabilities of LLMs (e.g., GPT-4) to generate more accurate open-domain, open-platform diagrams. In the first stage, we use LLMs to generate and iteratively refine 'diagram plans' (in a planner-auditor feedback loop) which describe all the entities (objects and text labels), their relationships (arrows or lines), and their bounding box layouts. In the second stage, we use a diagram generator, DiagramGLIGEN, and a text label rendering module to generate diagrams following the diagram plans. To benchmark the text-to-diagram generation task, we introduce AI2D-Caption, a densely annotated diagram dataset built on top of the AI2D dataset. We show quantitatively and qualitatively that our DiagrammerGPT framework produces more accurate diagrams, outperforming existing T2I models. We also provide comprehensive analysis including open-domain diagram generation, vector graphic diagram generation in different platforms, human-in-the-loop diagram plan editing, and multimodal planner/auditor LLMs (e.g., GPT-4Vision). We hope our work can inspire further research on diagram generation via T2I models and LLMs.

翻译：文本到图像生成在过去几年取得了显著发展，然而利用文本到图像模型生成图表的研究却鲜有涉及。图表是一种符号/示意图表示，通过结构丰富且空间复杂的可视化方式（例如密集组合的相关对象、文本标签、方向箭头、连接线等）阐释信息。现有最先进的文本到图像模型在图表生成中常表现不佳，原因在于当大量对象通过箭头/线条等复杂关系密集连接时，缺乏细粒度的对象布局控制，且难以生成可理解的文本标签。为填补这一空白，我们提出DiagrammerGPT——一种新颖的两阶段文本到图表生成框架，该框架利用大语言模型（如GPT-4）的布局引导能力，生成更准确的开放域、开放平台图表。第一阶段，我们使用大语言模型生成并迭代优化"图表规划"（通过规划器-审计器反馈循环），描述所有实体（对象和文本标签）、其关系（箭头或线条）及边界框布局。第二阶段，我们采用图表生成器DiagramGLIGEN和文本标签渲染模块，依据图表规划生成图表。为基准测试文本到图表生成任务，我们引入AI2D-Caption——基于AI2D数据集构建的密集标注图表数据集。定量与定性实验表明，我们的DiagrammerGPT框架能生成更准确的图表，性能优于现有文本到图像模型。我们还提供全面分析，包括开放域图表生成、不同平台下的矢量图形图表生成、人机协作的图表规划编辑，以及多模态规划器/审计器LLM（如GPT-4Vision）。我们希望这项工作能推动基于文本到图像模型和大语言模型的图表生成研究。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日