Large language models (LLMs) are increasingly used to assist scientists across diverse workflows. A key challenge is generating high-quality figures from textual descriptions, often represented as TikZ programs that can be rendered as scientific images. Prior research has proposed a variety of datasets and modeling approaches for this task. However, existing datasets for Text-to-TikZ are too small and noisy to capture the complexity of TikZ, causing mismatches between text and rendered figures. Moreover, prior approaches rely solely on supervised fine-tuning (SFT), which does not expose the model to the rendered semantics of the figure, often resulting in errors such as looping, irrelevant content, and incorrect spatial relations. To address these issues, we construct DaTikZ-V4, a dataset more than four times larger and substantially higher in quality than DaTikZ-V3, enriched with LLM-generated figure descriptions. Using this dataset, we train TikZilla, a family of small open-source Qwen models (3B and 8B) with a two-stage pipeline of SFT followed by reinforcement learning (RL). For RL, we leverage an image encoder trained via inverse graphics to provide semantically faithful reward signals. Extensive human evaluations with over 1,000 judgments show that TikZilla improves by 1.5-2 points over its base models on a 5-point scale, surpasses GPT-4o by 0.5 points, and matches GPT-5 in the image-based evaluation, while operating at much smaller model sizes. Code, data, and models will be made available.
翻译:大型语言模型(LLM)正日益广泛地应用于辅助科学家的多样化工作流程。一个关键挑战在于如何从文本描述生成高质量图表,这些描述通常以可渲染为科学图像的TikZ程序形式呈现。先前研究已为此任务提出了多种数据集与建模方法。然而,现有用于文本到TikZ转换的数据集规模过小且噪声较多,难以捕捉TikZ的复杂性,导致文本描述与渲染图像之间存在失配。此外,现有方法仅依赖监督微调(SFT),未能让模型接触图像的渲染语义,常引发循环结构、无关内容及错误空间关系等问题。为解决这些难题,我们构建了DaTikZ-V4数据集,其规模较DaTikZ-V3扩大四倍以上,质量显著提升,并融合了LLM生成的图表描述。基于该数据集,我们训练了TikZilla系列模型——采用3B与8B参数量的小型开源Qwen模型,通过监督微调与强化学习(RL)的两阶段流程进行训练。在强化学习阶段,我们利用经逆向图形训练的图像编码器提供语义保真的奖励信号。超过1,000项人工评估结果表明:在5分制评分中,TikZilla较其基础模型提升1.5-2分,超越GPT-4o达0.5分,在基于图像的评估中与GPT-5表现相当,同时模型参数量显著更小。相关代码、数据与模型将公开发布。