HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales

Current captioning datasets focus on object-centric captions, describing the visible objects in the image, e.g. "people eating food in a park". Although these datasets are useful to evaluate the ability of Vision & Language models to recognize and describe visual content, they do not support controlled experiments involving model testing or fine-tuning, with more high-level captions, which humans find easy and natural to produce. For example, people often describe images based on the type of scene they depict ('people at a holiday resort') and the actions they perform ('people having a picnic'). Such descriptions draw on personal experience and commonsense assumptions. We present the High-Level Dataset a dataset extending 14997 images from the COCO dataset, aligned with a new set of 134,973 human-annotated (high-level) captions collected along three axes: scenes, actions, and rationales. We further extend this dataset with confidence scores collected from an independent set of readers, as well as a set of narrative captions generated synthetically, by combining each of the three axes. We describe this dataset and analyse it extensively. We also present baseline results for the High-Level Captioning task.

翻译：当前的图像描述数据集主要关注以物体为中心的描述，例如“人们在公园里吃东西”。尽管这些数据集有助于评估视觉与语言模型识别和描述视觉内容的能力，但它们不支持涉及高级描述（人类发现容易且自然产生）的模型测试或微调等受控实验。例如，人们通常根据图像描绘的场景类型（“人们在度假胜地”）和他们执行的动作（“人们在野餐”）来描述图像。这类描述依赖于个人经验和常识性假设。我们提出了一个高级数据集，该数据集扩展了COCO数据集的14997张图像，并与一组新的134973条人工标注（高级）描述相对应，这些描述沿三个维度收集：场景、动作和理由。我们进一步扩展了这个数据集，包括从一组独立读者那里收集的置信度评分，以及通过结合这三个维度综合生成的一组叙事性描述。我们描述了这一数据集并对其进行了广泛分析。我们还展示了高级描述任务的基线结果。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日