ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese

In recent years, Visual Question Answering (VQA) has gained significant attention for its diverse applications, including intelligent car assistance, aiding visually impaired individuals, and document image information retrieval using natural language queries. VQA requires effective integration of information from questions and images to generate accurate answers. Neural models for VQA have made remarkable progress on large-scale datasets, with a primary focus on resource-rich languages like English. To address this, we introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese while mitigating biases. The dataset comprises over 26,000 images and 30,000 question-answer pairs (QAs), each question annotated to specify the type of reasoning involved. Leveraging this dataset, we conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations. Furthermore, we present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions. The architecture effectively employs transformers to enable simultaneous reasoning over textual and visual data, merging both modalities at an early model stage. The experimental findings demonstrate that our proposed model achieves state-of-the-art performance across four evaluation metrics. The accompanying code and dataset have been made publicly accessible at \url{https://github.com/kvt0012/ViCLEVR}. This provision seeks to stimulate advancements within the research community, fostering the development of more multimodal fusion algorithms, specifically tailored to address the nuances of low-resource languages, exemplified by Vietnamese.

翻译：近年来，视觉问答（VQA）因其在智能车载辅助、视障人群帮助及基于自然语言查询的文档图像信息检索等多元化应用中受到广泛关注。VQA需要有效整合问题与图像信息以生成准确答案。面向VQA的神经模型在大规模数据集上取得了显著进展，但主要聚焦于英语等资源丰富语言。为解决这一问题，我们提出ViCLEVR数据集——该首创性数据集旨在评估越南语环境下多种视觉推理能力，同时减少偏差。数据集包含超过26,000张图像及30,000个问答对，每个问题均标注了所涉及的推理类型。基于该数据集，我们对当代视觉推理系统展开全面分析，深入揭示其优势与局限性。此外，我们提出PhoVIT——一种能够根据问题识别图像中目标的综合多模态融合架构。该架构有效运用Transformer实现文本与视觉数据的同步推理，并在模型早期阶段融合两种模态。实验结果表明，所提模型在四项评估指标上均达到最优性能。相关代码与数据集已公开于\url{https://github.com/kvt0012/ViCLEVR}。此举旨在推动研究社区进步，促进面向低资源语言（以越南语为例）特性的多模态融合算法发展。