The ideal form of Visual Question Answering requires understanding, grounding and reasoning in the joint space of vision and language and serves as a proxy for the AI task of scene understanding. However, most existing VQA benchmarks are limited to just picking the answer from a pre-defined set of options and lack attention to text. We present a new challenge with a dataset that contains 23,781 questions based on 10124 image-text pairs. Specifically, the task requires the model to align multimedia representations of the same entity to implement multi-hop reasoning between image and text and finally use natural language to answer the question. The aim of this challenge is to develop and benchmark models that are capable of multimedia entity alignment, multi-step reasoning and open-ended answer generation.
翻译:视觉问答的理想形态要求模型在视觉与语言的联合空间中具备理解、锚定与推理能力,并可作为场景理解这一人工智能任务的代理。然而,现有大多数VQA基准仅局限于从预定义选项集合中选择答案,且缺乏对文本信息的关注。本文提出一项新挑战,基于10124个图像-文本对构建了包含23781个问题的数据集。具体而言,该任务要求模型对齐同一实体的多模态表征,以在图像与文本之间实现多跳推理,最终使用自然语言回答问题。本挑战旨在开发与评估能够进行多模态实体对齐、多步推理及开放式答案生成的模型。