The State and Fate of Summarization Datasets: A Survey

Automatic summarization has consistently attracted attention due to its versatility and wide application in various downstream tasks. Despite its popularity, we find that annotation efforts have largely been disjointed, and have lacked common terminology. Consequently, it is challenging to discover existing resources or identify coherent research directions. To address this, we survey a large body of work spanning 133 datasets in over 100 languages, creating a novel ontology covering sample properties, collection methods and distribution. With this ontology we make key observations, including the lack in accessible high-quality datasets for low-resource languages, and the field's over-reliance on the news domain and on automatically collected distant supervision. Finally, we make available a web interface that allows users to interact and explore our ontology and dataset collection, as well as a template for a summarization data card, which can be used to streamline future research into a more coherent body of work.

翻译：自动摘要技术因其多功能性及在各类下游任务中的广泛应用而持续受到关注。尽管该领域广受欢迎，但我们发现现有标注工作大多相互脱节，且缺乏统一的术语体系。因此，探索现有资源或识别连贯的研究方向面临挑战。为解决这一问题，我们系统调研了涵盖100余种语言的133个数据集，构建了一个涵盖样本属性、收集方法与分布特征的新型本体框架。基于该本体框架，我们得出若干关键发现，包括低资源语言缺乏可获取的高质量数据集，以及该领域对新闻领域和自动收集的远程监督的过度依赖。最后，我们提供了可交互探索本体框架与数据集集合的网络界面，并设计了摘要数据卡片模板，以期将未来研究整合为更具连贯性的体系。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日