DiagrammerGPT: Generating Open-Domain, Open-Platform Diagrams via LLM Planning

Text-to-image (T2I) generation has seen significant growth over the past few years. Despite this, there has been little work on generating diagrams with T2I models. A diagram is a symbolic/schematic representation that explains information using structurally rich and spatially complex visualizations (e.g., a dense combination of related objects, text labels, directional arrows/lines, etc.). Existing state-of-the-art T2I models often fail at diagram generation because they lack fine-grained object layout control when many objects are densely connected via complex relations such as arrows/lines, and also often fail to render comprehensible text labels. To address this gap, we present DiagrammerGPT, a novel two-stage text-to-diagram generation framework leveraging the layout guidance capabilities of LLMs to generate more accurate diagrams. In the first stage, we use LLMs to generate and iteratively refine 'diagram plans' (in a planner-auditor feedback loop). In the second stage, we use a diagram generator, DiagramGLIGEN, and a text label rendering module to generate diagrams (with clear text labels) following the diagram plans. To benchmark the text-to-diagram generation task, we introduce AI2D-Caption, a densely annotated diagram dataset built on top of the AI2D dataset. We show that our DiagrammerGPT framework produces more accurate diagrams, outperforming existing T2I models. We also provide comprehensive analysis, including open-domain diagram generation, multi-platform vector graphic diagram generation, human-in-the-loop editing, and multimodal planner/auditor LLMs.

翻译：文本到图像（T2I）生成在过去几年中取得了显著进展。尽管如此，利用T2I模型生成图表的研究仍然很少。图表是一种符号化/示意性的表示方法，它通过结构丰富且空间复杂的可视化（例如，相关对象、文本标签、方向性箭头/线条等的密集组合）来解释信息。现有的最先进T2I模型在图表生成方面常常失败，原因在于当大量对象通过箭头/线条等复杂关系密集连接时，它们缺乏细粒度的对象布局控制，并且通常无法渲染可理解的文本标签。为了弥补这一差距，我们提出了DiagrammerGPT，这是一个新颖的两阶段文本到图表生成框架，它利用大语言模型的布局引导能力来生成更准确的图表。在第一阶段，我们使用大语言模型生成并迭代优化“图表规划”（在规划器-审核器反馈循环中）。在第二阶段，我们使用图表生成器DiagramGLIGEN和一个文本标签渲染模块，根据图表规划生成带有清晰文本标签的图表。为了对文本到图表生成任务进行基准测试，我们引入了AI2D-Caption，这是一个基于AI2D数据集构建的密集标注图表数据集。我们展示了我们的DiagrammerGPT框架能够生成更准确的图表，其性能优于现有的T2I模型。我们还提供了全面的分析，包括开放领域图表生成、多平台矢量图形图表生成、人在回路编辑以及多模态规划器/审核器大语言模型。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日