Visual Question Answering (VQA) is a challenging task that requires the joint understanding of natural language and visual content. While early research primarily focused on recognizing objects and scene context, it often overlooked scene text-an essential source of explicit semantic information. This paper introduces \textbf{ViTextVQA} (\textbf{Vi}etnamese \textbf{Text}-based \textbf{V}isual \textbf{Q}uestion \textbf{A}nswering), the first large-scale Vietnamese dataset specializing in text-based VQA. The dataset contains \textbf{over 16,000} images and \textbf{over 50,000} question-answer pairs. To tackle this task efficiently, \textbf{ViTextBLIP-2} (Vietnamese Text-based Bootstrapped Language-Image Model via Fine-tuning) is proposed, a novel multimodal feature fusion method designed to optimize Vietnamese text-based VQA. Experiments with state-of-the-art models highlight the importance of token ordering in OCR text for answer generation, leading to significant performance improvements. The ViTextVQA dataset is publicly available for research purposes.
翻译:视觉问答(VQA)是一项需要同时理解自然语言和视觉内容的挑战性任务。早期研究主要集中于识别物体和场景上下文,但常常忽略了场景文本——这一显式语义信息的重要来源。本文介绍了 **ViTextVQA**(**Vi**etnamese **Text**-based **V**isual **Q**uestion **A**nswering),这是首个专注于基于文本的VQA的大规模越南语数据集。该数据集包含**超过16,000张**图像和**超过50,000个**问答对。为有效应对此任务,本文提出了 **ViTextBLIP-2**(Vietnamese Text-based Bootstrapped Language-Image Model via Fine-tuning),这是一种新颖的多模态特征融合方法,旨在优化基于越南语文本的VQA。与最先进模型的实验突显了OCR文本中词元顺序对于答案生成的重要性,从而带来了显著的性能提升。ViTextVQA数据集已公开发布,供研究使用。