Generating bitmap graphics from text has gained considerable attention, yet for scientific figures, vector graphics are often preferred. Given that vector graphics are typically encoded using low-level graphics primitives, generating them directly is difficult. To address this, we propose the use of TikZ, a well-known abstract graphics language that can be compiled to vector graphics, as an intermediate representation of scientific figures. TikZ offers human-oriented, high-level commands, thereby facilitating conditional language modeling with any large language model. To this end, we introduce DaTikZ, the first large-scale TikZ dataset consisting of 120k TikZ drawings aligned with captions. We fine-tune LLaMA on DaTikZ, as well as our new model CLiMA, which augments LLaMA with multimodal CLIP embeddings. In both human and automatic evaluation, CLiMA and LLaMA outperform commercial GPT-4 and Claude 2 in terms of similarity to human-created figures, with CLiMA additionally improving text-image alignment. Our detailed analysis shows that all models generalize well and are not susceptible to memorization. GPT-4 and Claude 2, however, tend to generate more simplistic figures compared to both humans and our models. We make our framework, AutomaTikZ, along with model weights and datasets, publicly available.
翻译:从文本生成位图图像已引起广泛关注,但科学图形通常更偏好矢量格式。由于矢量图形通常通过底层图形基元编码,直接生成较为困难。为此,我们提出利用TikZ(一种可编译为矢量图形的知名抽象图形语言)作为科学图形的中间表示。TikZ提供面向人类的高级指令,便于任意大型语言模型进行条件语言建模。基于此,我们推出DaTikZ——首个大规模TikZ数据集,包含12万幅与标题对齐的TikZ绘图。我们在DaTikZ上微调了LLaMA模型,并提出了新模型CLiMA,该模型通过多模态CLIP嵌入增强LLaMA。在人工与自动评估中,CLiMA和LLaMA在图形与人工创作的相似度上均优于商业模型GPT-4与Claude 2,其中CLiMA进一步提升了文本-图像对齐能力。详细分析表明,所有模型均具备良好的泛化能力且不易过拟合记忆。然而,GPT-4与Claude 2生成的图形趋向于比人类及我们的模型更简单。我们将框架AutomaTikZ、模型权重及数据集公开。