Visual question answering (VQA) is one of the crucial vision-and-language tasks. Yet, existing VQA research has mostly focused on the English language, due to a lack of suitable evaluation resources. Previous work on cross-lingual VQA has reported poor zero-shot transfer performance of current multilingual multimodal Transformers with large gaps to monolingual performance, without any deeper analysis. In this work, we delve deeper into the different aspects of cross-lingual VQA, aiming to understand the impact of 1) modeling methods and choices, including architecture, inductive bias, fine-tuning; 2) learning biases: including question types and modality biases in cross-lingual setups. The key results of our analysis are: 1) We show that simple modifications to the standard training setup can substantially reduce the transfer gap to monolingual English performance, yielding +10 accuracy points over existing methods. 2) We analyze cross-lingual VQA across different question types of varying complexity for different multilingual multimodal Transformers, and identify question types that are the most difficult to improve on. 3) We provide an analysis of modality biases present in training data and models, revealing why zero-shot performance gaps remain for certain question types and languages.
翻译:视觉问答(VQA)是视觉与语言领域中的关键任务之一。然而,由于缺乏合适的评估资源,现有的VQA研究主要集中在英语语言上。先前关于跨语言VQA的研究报告了当前多语言多模态Transformer在零样本迁移中的表现不佳,与单语言性能存在较大差距,但未进行深入分析。本研究深入探究跨语言VQA的不同方面,旨在理解以下因素的影响:1)建模方法与选择,包括架构、归纳偏置、微调;2)学习偏置:包括跨语言设置中的问题类型和模态偏置。我们分析的主要结果是:1)我们表明,对标准训练设置进行简单修改可显著缩小与单语言英语性能的迁移差距,相比现有方法提升10个准确率百分点。2)我们针对不同复杂度的多种问题类型,分析了多种多语言多模态Transformer在跨语言VQA中的表现,并识别出最难改进的问题类型。3)我们提供了训练数据和模型中模态偏置的分析,揭示了为何某些问题类型和语言仍存在零样本性能差距。