Visual Question Answering (VQA) is a task that requires computers to give correct answers for the input questions based on the images. This task can be solved by humans with ease but is a challenge for computers. The VLSP2022-EVJVQA shared task carries the Visual Question Answering task in the multilingual domain on a newly released dataset: UIT-EVJVQA, in which the questions and answers are written in three different languages: English, Vietnamese and Japanese. We approached the challenge as a sequence-to-sequence learning task, in which we integrated hints from pre-trained state-of-the-art VQA models and image features with Convolutional Sequence-to-Sequence network to generate the desired answers. Our results obtained up to 0.3442 by F1 score on the public test set, 0.4210 on the private test set, and placed 3rd in the competition.
翻译:视觉问答(VQA)是一项要求计算机基于图像对输入问题给出正确答案的任务。人类可以轻松完成此任务,但对计算机而言却是一项挑战。VLSP2022-EVJVQA共享任务在最新发布的UIT-EVJVQA数据集上开展了多语言领域的视觉问答任务,其中问题和答案以三种不同语言(英语、越南语和日语)书写。我们将该挑战视为序列到序列学习任务,通过整合来自预训练最先进VQA模型的提示信息以及图像特征,结合卷积序列到序列网络来生成期望的答案。我们的方法在公开测试集上获得了高达0.3442的F1分数,在私有测试集上获得0.4210,并在竞赛中排名第三。