Automatic summarization has consistently attracted attention, due to its versatility and wide application in various downstream tasks. Despite its popularity, we find that annotation efforts have largely been disjointed, and have lacked common terminology. Consequently, it is challenging to discover existing resources or identify coherent research directions. To address this, we survey a large body of work spanning 133 datasets in over 100 languages, creating a novel ontology covering sample properties, collection methods and distribution. With this ontology we make key observations, including the lack in accessible high-quality datasets for low-resource languages, and the field's over-reliance on the news domain and on automatically collected distant supervision. Finally, we make available a web interface that allows users to interact and explore our ontology and dataset collection, as well as a template for a summarization data card, which can be used to streamline future research into a more coherent body of work.
翻译:自动摘要技术因其多功能性及在下游任务中的广泛应用而持续受到关注。尽管该领域备受瞩目,我们发现现有标注工作大多相互脱节,且缺乏统一的术语体系。这导致研究者难以发现现有资源或明确连贯的研究方向。为此,我们系统调研了涵盖100余种语言的133个数据集,构建了一个包含样本属性、收集方法与分布特征的新型本体框架。基于该框架,我们揭示了若干关键问题:低资源语言缺乏可获取的高质量数据集,且该领域过度依赖新闻领域数据及自动收集的远程监督方法。最后,我们开发了可供用户交互探索本体框架与数据集集合的网页界面,并设计了摘要数据卡片模板,以期将未来研究整合为更具连贯性的体系。