Automatic summarization has consistently attracted attention due to its versatility and wide application in various downstream tasks. Despite its popularity, we find that annotation efforts have largely been disjointed, and have lacked common terminology. Consequently, it is challenging to discover existing resources or identify coherent research directions. To address this, we survey a large body of work spanning 133 datasets in over 100 languages, creating a novel ontology covering sample properties, collection methods and distribution. With this ontology we make key observations, including the lack in accessible high-quality datasets for low-resource languages, and the field's over-reliance on the news domain and on automatically collected distant supervision. Finally, we make available a web interface that allows users to interact and explore our ontology and dataset collection, as well as a template for a summarization data card, which can be used to streamline future research into a more coherent body of work.
翻译:自动摘要技术因其多功能性及在各类下游任务中的广泛应用而持续受到关注。尽管该领域广受欢迎,但我们发现现有标注工作大多相互脱节,且缺乏统一的术语体系。因此,探索现有资源或识别连贯的研究方向面临挑战。为解决这一问题,我们系统调研了涵盖100余种语言的133个数据集,构建了一个涵盖样本属性、收集方法与分布特征的新型本体框架。基于该本体框架,我们得出若干关键发现,包括低资源语言缺乏可获取的高质量数据集,以及该领域对新闻领域和自动收集的远程监督的过度依赖。最后,我们提供了可交互探索本体框架与数据集集合的网络界面,并设计了摘要数据卡片模板,以期将未来研究整合为更具连贯性的体系。