Visual Question Answering (VQA) has been primarily studied through the lens of the English language. Yet, tackling VQA in other languages in the same manner would require a considerable amount of resources. In this paper, we propose scalable solutions to multilingual visual question answering (mVQA), on both data and modeling fronts. We first propose a translation-based framework to mVQA data generation that requires much less human annotation efforts than the conventional approach of directly collection questions and answers. Then, we apply our framework to the multilingual captions in the Crossmodal-3600 dataset and develop an efficient annotation protocol to create MaXM, a test-only VQA benchmark in 7 diverse languages. Finally, we propose an approach to unified, extensible, open-ended, and end-to-end mVQA modeling and demonstrate strong performance in 13 languages.
翻译:视觉问答(VQA)研究主要聚焦于英语环境。然而,以相同方式处理其他语言的VQA将需要大量资源。本文从数据和建模两个维度提出可扩展的多语言视觉问答(mVQA)解决方案。我们首先提出基于翻译的mVQA数据生成框架,与传统直接收集问答对的方法相比,该框架所需人工标注工作量显著减少。接着,我们将该框架应用于Crossmodal-3600数据集中的多语言字幕,通过设计高效的标注协议,构建了覆盖7种不同语言的仅测试VQA基准数据集MaXM。最后,我们提出统一、可扩展、开放式且端到端的mVQA建模方法,并在13种语言上展现出卓越性能。