ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images

Visual Question Answerinng (VQA) is a complicated task that requires the capability of simultaneously processing natural language and images. This task was initially researched with a focus on developing methods to help machines understand objects and scene contexts in images. However, some scene text that carries explicit information about the full content of the image is not mentioned. Along with the continuous development of the AI era, there have been many studies on the reading comprehension ability of VQA models in the world. Therefore, we introduce the first large-scale dataset in Vietnamese specializing in the ability to understand scene text, we call it ViTextVQA (\textbf{Vi}etnamese \textbf{Text}-based \textbf{V}isual \textbf{Q}uestion \textbf{A}nswering dataset) which contains \textbf{over 16,000} images and \textbf{over 50,000} questions with answers. To tackle this task efficiently, we propose ViTextBLIP-2, an novel multimodal feature fusion Method, which optimizes Vietnamese OCR-based VQA by integrating a frozen Vision Transformer, SwinTextSpotter OCR, and ViT5 LLM with a trainable Q-Former for multimodal feature fusion. Through experiments with various state-of-the-art models, we uncover the significance of the order in which tokens in OCR text are processed and selected to formulate answers. This finding helped us significantly improve the performance of the baseline models on the ViTextVQA dataset. Our dataset is available (https://github.com/minhquan6203/ViTextVQA-Dataset) for research purposes.

翻译：视觉问答（VQA）是一项复杂的任务，需要同时处理自然语言和图像的能力。该任务最初的研究重点在于开发帮助机器理解图像中物体和场景上下文的方法。然而，图像中一些承载完整内容显性信息的场景文本未被提及。随着人工智能时代的持续发展，全球已有许多针对VQA模型阅读理解能力的研究。为此，我们推出了首个专注于场景文本理解能力的大规模越南语数据集，我们称之为ViTextVQA（**Vi**etnamese **Text**-based **V**isual **Q**uestion **A**nswering dataset），该数据集包含**超过16,000张**图像和**超过50,000个**带答案的问题。为有效应对此任务，我们提出了ViTextBLIP-2，一种新颖的多模态特征融合方法，该方法通过集成冻结的Vision Transformer、SwinTextSpotter OCR和ViT5 LLM，并采用可训练的Q-Former进行多模态特征融合，从而优化了基于越南语OCR的VQA。通过对多种最先进模型进行实验，我们揭示了OCR文本中标记的处理和选择顺序对构建答案的重要性。这一发现帮助我们显著提升了基线模型在ViTextVQA数据集上的性能。我们的数据集（https://github.com/minhquan6203/ViTextVQA-Dataset）可供研究使用。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

文档视觉问答简述

专知会员服务

7+阅读 · 2025年10月17日

【万字长文】视觉问答VQA：从早期发展到最新进展——综述

专知会员服务

26+阅读 · 2025年1月8日

【AAAI2024】BOK-VQA：基于双语外部知识的视觉问题回答，通过图表示预训练

专知会员服务

24+阅读 · 2024年1月15日