Visual Question Answerinng (VQA) is a complicated task that requires the capability of simultaneously processing natural language and images. This task was initially researched with a focus on developing methods to help machines understand objects and scene contexts in images. However, some scene text that carries explicit information about the full content of the image is not mentioned. Along with the continuous development of the AI era, there have been many studies on the reading comprehension ability of VQA models in the world. Therefore, we introduce the first large-scale dataset in Vietnamese specializing in the ability to understand scene text, we call it ViTextVQA (\textbf{Vi}etnamese \textbf{Text}-based \textbf{V}isual \textbf{Q}uestion \textbf{A}nswering dataset) which contains \textbf{over 16,000} images and \textbf{over 50,000} questions with answers. To tackle this task efficiently, we propose ViTextBLIP-2, an novel multimodal feature fusion Method, which optimizes Vietnamese OCR-based VQA by integrating a frozen Vision Transformer, SwinTextSpotter OCR, and ViT5 LLM with a trainable Q-Former for multimodal feature fusion. Through experiments with various state-of-the-art models, we uncover the significance of the order in which tokens in OCR text are processed and selected to formulate answers. This finding helped us significantly improve the performance of the baseline models on the ViTextVQA dataset. Our dataset is available (https://github.com/minhquan6203/ViTextVQA-Dataset) for research purposes.
翻译:视觉问答(VQA)是一项复杂的任务,需要同时处理自然语言和图像的能力。该任务最初的研究重点在于开发帮助机器理解图像中物体和场景上下文的方法。然而,图像中一些承载完整内容显性信息的场景文本未被提及。随着人工智能时代的持续发展,全球已有许多针对VQA模型阅读理解能力的研究。为此,我们推出了首个专注于场景文本理解能力的大规模越南语数据集,我们称之为ViTextVQA(**Vi**etnamese **Text**-based **V**isual **Q**uestion **A**nswering dataset),该数据集包含**超过16,000张**图像和**超过50,000个**带答案的问题。为有效应对此任务,我们提出了ViTextBLIP-2,一种新颖的多模态特征融合方法,该方法通过集成冻结的Vision Transformer、SwinTextSpotter OCR和ViT5 LLM,并采用可训练的Q-Former进行多模态特征融合,从而优化了基于越南语OCR的VQA。通过对多种最先进模型进行实验,我们揭示了OCR文本中标记的处理和选择顺序对构建答案的重要性。这一发现帮助我们显著提升了基线模型在ViTextVQA数据集上的性能。我们的数据集(https://github.com/minhquan6203/ViTextVQA-Dataset)可供研究使用。