Creating high-quality scientific figures can be time-consuming and challenging, even though sketching ideas on paper is relatively easy. Furthermore, recreating existing figures that are not stored in formats preserving semantic information is equally complex. To tackle this problem, we introduce DeTikZify, a novel multimodal language model that automatically synthesizes scientific figures as semantics-preserving TikZ graphics programs based on sketches and existing figures. To achieve this, we create three new datasets: DaTikZv2, the largest TikZ dataset to date, containing over 360k human-created TikZ graphics; SketchFig, a dataset that pairs hand-drawn sketches with their corresponding scientific figures; and MetaFig, a collection of diverse scientific figures and associated metadata. We train DeTikZify on MetaFig and DaTikZv2, along with synthetically generated sketches learned from SketchFig. We also introduce an MCTS-based inference algorithm that enables DeTikZify to iteratively refine its outputs without the need for additional training. Through both automatic and human evaluation, we demonstrate that DeTikZify outperforms commercial Claude 3 and GPT-4V in synthesizing TikZ programs, with the MCTS algorithm effectively boosting its performance. We make our code, models, and datasets publicly available.
翻译:创建高质量的科学图表可能耗时且具有挑战性,尽管在纸上勾勒想法相对容易。此外,重新绘制那些未以保留语义信息的格式存储的现有图表同样复杂。为解决此问题,我们提出了DeTikZify,一种新颖的多模态语言模型,能够基于草图和现有图表,自动将科学图表合成为保留语义的TikZ图形程序。为实现这一目标,我们创建了三个新数据集:DaTikZv2,迄今为止最大的TikZ数据集,包含超过36万个人工创建的TikZ图形;SketchFig,一个将手绘草图与其对应的科学图表配对的数据集;以及MetaFig,一个包含多样化科学图表及相关元数据的集合。我们在MetaFig和DaTikZv2上训练DeTikZify,同时结合从SketchFig学习生成的合成草图。我们还引入了一种基于MCTS的推理算法,使DeTikZify能够在无需额外训练的情况下迭代优化其输出。通过自动评估和人工评估,我们证明DeTikZify在合成TikZ程序方面优于商业模型Claude 3和GPT-4V,且MCTS算法有效提升了其性能。我们将代码、模型和数据集公开提供。