VLSP2022-EVJVQA Challenge: Multilingual Visual Question Answering

Visual Question Answering (VQA) is a challenging task of natural language processing (NLP) and computer vision (CV), attracting significant attention from researchers. English is a resource-rich language that has witnessed various developments in datasets and models for visual question answering. Visual question answering in other languages also would be developed for resources and models. In addition, there is no multilingual dataset targeting the visual content of a particular country with its own objects and cultural characteristics. To address the weakness, we provide the research community with a benchmark dataset named EVJVQA, including 33,000+ pairs of question-answer over three languages: Vietnamese, English, and Japanese, on approximately 5,000 images taken from Vietnam for evaluating multilingual VQA systems or models. EVJVQA is used as a benchmark dataset for the challenge of multilingual visual question answering at the 9th Workshop on Vietnamese Language and Speech Processing (VLSP 2022). This task attracted 62 participant teams from various universities and organizations. In this article, we present details of the organization of the challenge, an overview of the methods employed by shared-task participants, and the results. The highest performances are 0.4392 in F1-score and 0.4009 in BLUE on the private test set. The multilingual QA systems proposed by the top 2 teams use ViT for the pre-trained vision model and mT5 for the pre-trained language model, a powerful pre-trained language model based on the transformer architecture. EVJVQA is a challenging dataset that motivates NLP and CV researchers to further explore the multilingual models or systems for visual question answering systems. We released the challenge on the Codalab evaluation system for further research.

翻译：视觉问答（VQA）是自然语言处理与计算机视觉领域的一项具有挑战性的任务，吸引了研究人员的广泛关注。英语作为资源丰富的语言，其在视觉问答数据集与模型方面已取得诸多进展。其他语言的视觉问答也需相应发展其资源与模型。此外，目前尚无针对特定国家视觉内容（包含该国特有物品与文化特征）的多语言数据集。为解决这一不足，我们向研究界提供了一个名为EVJVQA的基准数据集，该数据集包含33000余组问答对，涵盖越南语、英语和日语三种语言，基于约5000张越南实地拍摄图像构建，用于评估多语言VQA系统或模型。EVJVQA被用作第九届越南语言与语音处理研讨会（VLSP 2022）多语言视觉问答挑战赛的基准数据集。该任务吸引了来自不同大学和研究机构的62支参赛队伍。本文详细介绍了该挑战赛的组织细节、共享任务参与者所采用方法的概述及其结果。在私有测试集上，最佳性能表现为F1分数0.4392与BLUE分数0.4009。排名前两位团队提出的多语言问答系统均采用ViT作为预训练视觉模型，以及基于Transformer架构的强大预训练语言模型mT5。EVJVQA是一个具有挑战性的数据集，能够激励自然语言处理与计算机视觉领域的研究人员进一步探索面向视觉问答系统的多语言模型或系统。我们已将挑战赛发布至Codalab评估系统，以支持后续研究。