Visual Question Answering (VQA) is an intricate and demanding task that integrates natural language processing (NLP) and computer vision (CV), capturing the interest of researchers. The English language, renowned for its wealth of resources, has witnessed notable advancements in both datasets and models designed for VQA. However, there is a lack of models that target specific countries such as Vietnam. To address this limitation, we introduce a transformer-based Vietnamese model named BARTPhoBEiT. This model includes pre-trained Sequence-to-Sequence and bidirectional encoder representation from Image Transformers in Vietnamese and evaluates Vietnamese VQA datasets. Experimental results demonstrate that our proposed model outperforms the strong baseline and improves the state-of-the-art in six metrics: Accuracy, Precision, Recall, F1-score, WUPS 0.0, and WUPS 0.9.
翻译:视觉问答(VQA)是一项融合自然语言处理(NLP)与计算机视觉(CV)的复杂且具有挑战性的任务,吸引了众多研究者的关注。英语凭借其丰富的资源,在面向VQA的数据集和模型方面取得了显著进展。然而,针对越南等特定国家的模型仍较为匮乏。为弥补这一不足,我们提出了一种基于Transformer的越南语模型——BARTPhoBEiT。该模型融合了越南语预训练序列到序列模型与图像双向编码器表示,并在越南语VQA数据集上进行了评估。实验结果表明,我们所提出的模型在六个指标——准确率、精确率、召回率、F1分数、WUPS 0.0和WUPS 0.9上均超越了强基线模型,并提升了当前最优性能水平。