Advancing Vietnamese Visual Question Answering with Transformer and Convolutional Integration

Visual Question Answering (VQA) has recently emerged as a potential research domain, captivating the interest of many in the field of artificial intelligence and computer vision. Despite the prevalence of approaches in English, there is a notable lack of systems specifically developed for certain languages, particularly Vietnamese. This study aims to bridge this gap by conducting comprehensive experiments on the Vietnamese Visual Question Answering (ViVQA) dataset, demonstrating the effectiveness of our proposed model. In response to community interest, we have developed a model that enhances image representation capabilities, thereby improving overall performance in the ViVQA system. Specifically, our model integrates the Bootstrapping Language-Image Pre-training with frozen unimodal models (BLIP-2) and the convolutional neural network EfficientNet to extract and process both local and global features from images. This integration leverages the strengths of transformer-based architectures for capturing comprehensive contextual information and convolutional networks for detailed local features. By freezing the parameters of these pre-trained models, we significantly reduce the computational cost and training time, while maintaining high performance. This approach significantly improves image representation and enhances the performance of existing VQA systems. We then leverage a multi-modal fusion module based on a general-purpose multi-modal foundation model (BEiT-3) to fuse the information between visual and textual features. Our experimental findings demonstrate that our model surpasses competing baselines, achieving promising performance. This is particularly evident in its accuracy of $71.04\%$ on the test set of the ViVQA dataset, marking a significant advancement in our research area. The code is available at https://github.com/nngocson2002/ViVQA.

翻译：视觉问答（VQA）近年来已成为一个极具潜力的研究领域，引起了人工智能与计算机视觉领域众多学者的关注。尽管针对英语的VQA方法已较为普遍，但专门为某些语言（尤其是越南语）开发的系统仍明显不足。本研究旨在通过基于越南语视觉问答（ViVQA）数据集开展全面实验，验证所提出模型的有效性，以填补这一空白。为响应学界需求，我们开发了一个增强图像表征能力的模型，从而提升ViVQA系统的整体性能。具体而言，该模型整合了基于冻结单模态模型的引导式语言-图像预训练框架（BLIP-2）与卷积神经网络EfficientNet，以提取并处理图像的局部与全局特征。这种融合充分发挥了基于Transformer的架构在捕获全面上下文信息方面的优势，以及卷积网络在提取精细局部特征方面的特长。通过冻结这些预训练模型的参数，我们在保持高性能的同时显著降低了计算成本与训练时间。该方法显著改善了图像表征质量，并提升了现有VQA系统的性能。随后，我们采用基于通用多模态基础模型（BEiT-3）的多模态融合模块，对视觉与文本特征进行信息融合。实验结果表明，我们的模型超越了现有基线方法，取得了令人瞩目的性能表现。在ViVQA数据集测试集上达到$71.04\%$的准确率，尤其凸显了本研究在领域内的重要进展。相关代码已发布于https://github.com/nngocson2002/ViVQA。