Current visual question answering (VQA) models tend to be trained and evaluated on image-question pairs in isolation. However, the questions people ask are dependent on their informational needs and prior knowledge about the image content. To evaluate how situating images within naturalistic contexts shapes visual questions, we introduce CommVQA, a VQA dataset consisting of images, image descriptions, real-world communicative scenarios where the image might appear (e.g., a travel website), and follow-up questions and answers conditioned on the scenario. We show that CommVQA poses a challenge for current models. Providing contextual information to VQA models improves performance broadly, highlighting the relevance of situating systems within a communicative scenario.
翻译:当前的视觉问答(VQA)模型通常是在孤立的图像-问题对上进行训练和评估的。然而,人们提出的问题取决于其信息需求和对图像内容的先验知识。为了评估将图像置于自然情境中如何影响视觉问题的形成,我们引入了CommVQA,这是一个包含图像、图像描述、图像可能出现的真实交际场景(例如旅游网站)以及基于该场景的后续问题与答案的VQA数据集。我们表明,CommVQA对现有模型构成了挑战。为VQA模型提供上下文信息广泛提升了其性能,凸显了将系统置于交际情境中的重要性。