Professionals working in technical domain typically hand-draw (on whiteboard, paper, etc.) technical diagrams (e.g., flowcharts, block diagrams, etc.) during discussions; however, if they want to edit these later, it needs to be drawn from scratch. Modern day VLMs have made tremendous progress in image understanding but they struggle when it comes to understanding technical diagrams. One way to overcome this problem is to fine-tune on real world hand-drawn images, but it is not practically possible to generate large number of such images. In this paper, we introduce a large synthetically generated corpus (reflective of real world images) for training VLMs and subsequently evaluate VLMs on a smaller corpus of hand-drawn images (with the help of humans). We introduce several new self-supervision tasks for training and perform extensive experiments with various baseline models and fine-tune Llama 3.2 11B-instruct model on synthetic images on these tasks to obtain LLama-VL-TUG, which significantly improves the ROUGE-L performance of Llama 3.2 11B-instruct by 2.14x and achieves the best all-round performance across all baseline models. On real-world images, human evaluation reveals that we achieve minimum compilation errors across all baselines in 7 out of 8 diagram types and improve the average F1 score of Llama 3.2 11B-instruct by 6.97x.
翻译:技术领域的专业人士在讨论过程中通常会在白板、纸张等介质上手绘技术图表(如流程图、框图等);然而,若后续需要编辑这些图表,则必须重新绘制。当代视觉语言模型在图像理解方面已取得巨大进展,但在理解技术图表时仍面临困难。解决此问题的一种方法是在真实世界手绘图像上进行微调,但生成大量此类图像在实际操作中并不可行。本文引入了一个大规模合成生成的数据集(反映真实世界图像特征)用于训练视觉语言模型,随后在较小规模的手绘图像数据集上(借助人工辅助)对模型进行评估。我们提出了多项用于训练的新型自监督任务,并基于多种基线模型进行了广泛实验,通过在合成图像上对这些任务微调Llama 3.2 11B-instruct模型,最终获得LLama-VL-TUG模型。该模型将Llama 3.2 11B-instruct的ROUGE-L性能显著提升2.14倍,并在所有基线模型中实现了最佳综合性能。在真实世界图像评估中,人工评估显示我们在8类图表中的7类实现了所有基线模型中最少的编译错误,并将Llama 3.2 11B-instruct的平均F1分数提升了6.97倍。