JSynFlow: Japanese Synthesised Flowchart Visual Question Answering Dataset built with Large Language Models

Vision and language models (VLMs) are expected to analyse complex documents, such as those containing flowcharts, through a question-answering (QA) interface. The ability to recognise and interpret these flowcharts is in high demand, as they provide valuable insights unavailable in text-only explanations. However, developing VLMs with precise flowchart understanding requires large-scale datasets of flowchart images and corresponding text, the creation of which is highly time-consuming. To address this challenge, we introduce JSynFlow, a synthesised visual QA dataset for Japanese flowcharts, generated using large language models (LLMs). Our dataset comprises task descriptions for various business occupations, the corresponding flowchart images rendered from domain-specific language (DSL) code, and related QA pairs. This paper details the dataset's synthesis procedure and demonstrates that fine-tuning with JSynFlow significantly improves VLM performance on flowchart-based QA tasks. Our dataset is publicly available at https://huggingface.co/datasets/jri-advtechlab/jsynflow.

翻译：视觉语言模型（VLMs）被期望通过问答（QA）接口来分析包含流程图在内的复杂文档。识别和解读这些流程图的能力需求迫切，因为它们能提供纯文本解释所不具备的宝贵见解。然而，开发具备精确流程图理解能力的VLMs需要大规模的流程图图像及对应文本数据集，而创建此类数据集极其耗时。为应对这一挑战，我们引入了JSynFlow，这是一个利用大型语言模型（LLMs）生成的、用于日语流程图的合成视觉问答数据集。我们的数据集包含针对不同业务岗位的任务描述、由领域特定语言（DSL）代码渲染生成的对应流程图图像，以及相关的问答对。本文详述了该数据集的合成流程，并证明使用JSynFlow进行微调能显著提升VLM在基于流程图的问答任务上的性能。我们的数据集已在 https://huggingface.co/datasets/jri-advtechlab/jsynflow 公开提供。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

大视觉语言模型的高效推理：瓶颈剖析、关键技术与未来展望

专知会员服务

17+阅读 · 4月11日

【ICML2025】使用树搜索重新排序推理上下文，使大型视觉语言模型更强大

专知会员服务

7+阅读 · 2025年6月10日

视觉语言建模遇见遥感：模型、数据集与前景展望

专知会员服务

17+阅读 · 2025年5月21日

《多模态大语言模型在基于模型的系统工程中的视觉问答能力探索》最新报告

专知会员服务

21+阅读 · 2025年5月20日