Vision-and-language (V&L) models pretrained on large-scale multimodal data have demonstrated strong performance on various tasks such as image captioning and visual question answering (VQA). The quality of such models is commonly assessed by measuring their performance on unseen data that typically comes from the same distribution as the training data. However, when evaluated under out-of-distribution (out-of-dataset) settings for VQA, we observe that these models exhibit poor generalization. We comprehensively evaluate two pretrained V&L models under different settings (i.e. classification and open-ended text generation) by conducting cross-dataset evaluations. We find that these models tend to learn to solve the benchmark, rather than learning the high-level skills required by the VQA task. We also find that in most cases generative models are less susceptible to shifts in data distribution compared to discriminative ones, and that multimodal pretraining is generally helpful for OOD generalization. Finally, we revisit assumptions underlying the use of automatic VQA evaluation metrics, and empirically show that their stringent nature repeatedly penalizes models for correct responses.
翻译:视觉与语言(V&L)模型在大规模多模态数据上预训练后,在图像描述和视觉问答(VQA)等任务中展现出优异性能。此类模型的品质通常通过评估其在未见数据(通常与训练数据同分布)上的表现来衡量。然而,当在VQA的分布外(跨数据集)设置下评估时,我们发现这些模型表现出较差的泛化能力。我们通过跨数据集评估,系统性地考察了两种预训练V&L模型在不同设置(即分类和开放文本生成)下的表现。研究发现,这些模型往往倾向于学习解决基准测试,而非掌握VQA任务所需的高阶技能。我们还发现,在大多数情况下,生成式模型比判别式模型对数据分布偏移更不敏感,且多模态预训练通常有助于分布外泛化。最后,我们重新审视了自动化VQA评估指标的使用假设,并通过实验证明其严苛特性会反复惩罚模型给出的正确回答。