Visual Question Answering (VQA) is a task that requires computers to give correct answers for the input questions based on the images. This task can be solved by humans with ease but is a challenge for computers. The VLSP2022-EVJVQA shared task carries the Visual Question Answering task in the multilingual domain on a newly released dataset: UIT-EVJVQA, in which the questions and answers are written in three different languages: English, Vietnamese and Japanese. We approached the challenge as a sequence-to-sequence learning task, in which we integrated hints from pre-trained state-of-the-art VQA models and image features with Convolutional Sequence-to-Sequence network to generate the desired answers. Our results obtained up to 0.3442 by F1 score on the public test set, 0.4210 on the private test set, and placed 3rd in the competition.
翻译:视觉问答(VQA)是一项要求计算机根据图像对输入问题给出正确答案的任务。人类可以轻松完成此任务,但对计算机而言是一项挑战。VLSP2022-EVJVQA共享任务在新发布的UIT-EVJVQA数据集上开展多语言领域的视觉问答任务,其中问题和答案以三种不同语言书写:英语、越南语和日语。我们将此挑战视为序列到序列学习任务,其中整合了来自预训练最先进VQA模型的提示以及图像特征,并结合卷积序列到序列网络生成所需答案。我们在公开测试集上获得最高0.3442的F1分数,在私有测试集上获得0.4210的F1分数,并在竞赛中排名第三。