In this paper, we present Toloka Visual Question Answering, a new crowdsourced dataset allowing comparing performance of machine learning systems against human level of expertise in the grounding visual question answering task. In this task, given an image and a textual question, one has to draw the bounding box around the object correctly responding to that question. Every image-question pair contains the response, with only one correct response per image. Our dataset contains 45,199 pairs of images and questions in English, provided with ground truth bounding boxes, split into train and two test subsets. Besides describing the dataset and releasing it under a CC BY license, we conducted a series of experiments on open source zero-shot baseline models and organized a multi-phase competition at WSDM Cup that attracted 48 participants worldwide. However, by the time of paper submission, no machine learning model outperformed the non-expert crowdsourcing baseline according to the intersection over union evaluation score.
翻译:本文介绍了Toloka视觉问答数据集,这是一个新构建的众包数据集,旨在比较机器学习系统在基础视觉问答任务中与人类专家水平的性能。在该任务中,给定一张图像和一个文本问题,需绘制能正确回答该问题的边界框。每个图像-问题对包含一个响应,且每张图像仅有一个正确响应。我们的数据集包含45,199个英语图像-问题对,并附有真实边界框标注,划分为训练集和两个测试子集。除描述数据集并以CC BY许可协议发布外,我们还在开源零样本基线模型上进行了一系列实验,并在WSDM Cup中组织了一场多阶段竞赛,吸引了全球48名参与者。然而,截至论文提交时,根据交并比评估指标,尚无机器学习模型超越非专家众包基线。