In recent years, visual question answering (VQA) has attracted attention from the research community because of its highly potential applications (such as virtual assistance on intelligent cars, assistant devices for blind people, or information retrieval from document images using natural language as queries) and challenge. The VQA task requires methods that have the ability to fuse the information from questions and images to produce appropriate answers. Neural visual question answering models have achieved tremendous growth on large-scale datasets which are mostly for resource-rich languages such as English. However, available datasets narrow the VQA task as the answers selection task or answer classification task. We argue that this form of VQA is far from human ability and eliminates the challenge of the answering aspect in the VQA task by just selecting answers rather than generating them. In this paper, we introduce the OpenViVQA (Open-domain Vietnamese Visual Question Answering) dataset, the first large-scale dataset for VQA with open-ended answers in Vietnamese, consists of 11,000+ images associated with 37,000+ question-answer pairs (QAs). Moreover, we proposed FST, QuMLAG, and MLPAG which fuse information from images and answers, then use these fused features to construct answers as humans iteratively. Our proposed methods achieve results that are competitive with SOTA models such as SAAA, MCAN, LORA, and M4C. The dataset is available to encourage the research community to develop more generalized algorithms including transformers for low-resource languages such as Vietnamese.
翻译:近年来,视觉问答(VQA)因其极具潜力的应用(如智能汽车虚拟助手、盲人辅助设备、或通过自然语言查询从文档图像中检索信息)及挑战性而备受研究界关注。VQA任务要求方法具备融合问题与图像信息以生成恰当答案的能力。神经视觉问答模型已在以英语等资源丰富语言为主的大规模数据集上取得显著进展。然而,现有数据集将VQA任务局限为答案选择或答案分类任务。我们主张,这种形式的VQA远未达到人类能力水平,且通过仅选择答案而非生成答案,消除了VQA任务中回答方面的挑战。本文介绍了OpenViVQA(开放域越南语视觉问答)数据集,这是首个面向越南语开放式答案的大规模VQA数据集,包含11,000余张图像及37,000余个问答对。此外,我们提出了FST、QuMLAG和MLPAG模型,这些模型融合图像与答案信息,随后利用融合特征像人类一样迭代构建答案。我们提出的方法在性能上与SAAA、MCAN、LORA和M4C等SOTA模型具有竞争力。数据集已公开,以鼓励研究界为越南语等低资源语言开发包括Transformer在内的更通用算法。