Vision and language models (VLMs) are expected to analyse complex documents, such as those containing flowcharts, through a question-answering (QA) interface. The ability to recognise and interpret these flowcharts is in high demand, as they provide valuable insights unavailable in text-only explanations. However, developing VLMs with precise flowchart understanding requires large-scale datasets of flowchart images and corresponding text, the creation of which is highly time-consuming. To address this challenge, we introduce JSynFlow, a synthesised visual QA dataset for Japanese flowcharts, generated using large language models (LLMs). Our dataset comprises task descriptions for various business occupations, the corresponding flowchart images rendered from domain-specific language (DSL) code, and related QA pairs. This paper details the dataset's synthesis procedure and demonstrates that fine-tuning with JSynFlow significantly improves VLM performance on flowchart-based QA tasks. Our dataset is publicly available at https://huggingface.co/datasets/jri-advtechlab/jsynflow.
翻译:视觉语言模型(VLMs)被期望通过问答(QA)接口来分析包含流程图在内的复杂文档。识别和解读这些流程图的能力需求迫切,因为它们能提供纯文本解释所不具备的宝贵见解。然而,开发具备精确流程图理解能力的VLMs需要大规模的流程图图像及对应文本数据集,而创建此类数据集极其耗时。为应对这一挑战,我们引入了JSynFlow,这是一个利用大型语言模型(LLMs)生成的、用于日语流程图的合成视觉问答数据集。我们的数据集包含针对不同业务岗位的任务描述、由领域特定语言(DSL)代码渲染生成的对应流程图图像,以及相关的问答对。本文详述了该数据集的合成流程,并证明使用JSynFlow进行微调能显著提升VLM在基于流程图的问答任务上的性能。我们的数据集已在 https://huggingface.co/datasets/jri-advtechlab/jsynflow 公开提供。