TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning

Large language models (LLMs) are increasingly used to assist scientists across diverse workflows. A key challenge is generating high-quality figures from textual descriptions, often represented as TikZ programs that can be rendered as scientific images. Prior research has proposed a variety of datasets and modeling approaches for this task. However, existing datasets for Text-to-TikZ are too small and noisy to capture the complexity of TikZ, causing mismatches between text and rendered figures. Moreover, prior approaches rely solely on supervised fine-tuning (SFT), which does not expose the model to the rendered semantics of the figure, often resulting in errors such as looping, irrelevant content, and incorrect spatial relations. To address these issues, we construct DaTikZ-V4, a dataset more than four times larger and substantially higher in quality than DaTikZ-V3, enriched with LLM-generated figure descriptions. Using this dataset, we train TikZilla, a family of small open-source Qwen models (3B and 8B) with a two-stage pipeline of SFT followed by reinforcement learning (RL). For RL, we leverage an image encoder trained via inverse graphics to provide semantically faithful reward signals. Extensive human evaluations with over 1,000 judgments show that TikZilla improves by 1.5-2 points over its base models on a 5-point scale, surpasses GPT-4o by 0.5 points, and matches GPT-5 in the image-based evaluation, while operating at much smaller model sizes. Code, data, and models will be made available.

翻译：大型语言模型（LLM）正日益广泛地应用于辅助科学家的多样化工作流程。一个关键挑战在于如何从文本描述生成高质量图表，这些描述通常以可渲染为科学图像的TikZ程序形式呈现。先前研究已为此任务提出了多种数据集与建模方法。然而，现有用于文本到TikZ转换的数据集规模过小且噪声较多，难以捕捉TikZ的复杂性，导致文本描述与渲染图像之间存在失配。此外，现有方法仅依赖监督微调（SFT），未能让模型接触图像的渲染语义，常引发循环结构、无关内容及错误空间关系等问题。为解决这些难题，我们构建了DaTikZ-V4数据集，其规模较DaTikZ-V3扩大四倍以上，质量显著提升，并融合了LLM生成的图表描述。基于该数据集，我们训练了TikZilla系列模型——采用3B与8B参数量的小型开源Qwen模型，通过监督微调与强化学习（RL）的两阶段流程进行训练。在强化学习阶段，我们利用经逆向图形训练的图像编码器提供语义保真的奖励信号。超过1,000项人工评估结果表明：在5分制评分中，TikZilla较其基础模型提升1.5-2分，超越GPT-4o达0.5分，在基于图像的评估中与GPT-5表现相当，同时模型参数量显著更小。相关代码、数据与模型将公开发布。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

大语言模型中的检索与结构化增强生成综述

专知会员服务

34+阅读 · 2025年9月17日

【新书】使用大型语言模型进行数据分析：文本、表格、图像与音频

专知会员服务

43+阅读 · 2025年4月16日

《大语言模型的数据合成与增强综述》

专知会员服务

44+阅读 · 2024年10月19日

微软最新《检索增强生成（RAG）》综述

专知会员服务

58+阅读 · 2024年9月24日