Extracting who says what to whom is a crucial part in analyzing human communication in today's abundance of data such as online news articles. Yet, the lack of annotated data for this task in German news articles severely limits the quality and usability of possible systems. To remedy this, we present a new, freely available, creative-commons-licensed dataset for quotation attribution in German news articles based on WIKINEWS. The dataset provides curated, high-quality annotations across 1000 documents (250,000 tokens) in a fine-grained annotation schema enabling various downstream uses for the dataset. The annotations not only specify who said what but also how, in which context, to whom and define the type of quotation. We specify our annotation schema, describe the creation of the dataset and provide a quantitative analysis. Further, we describe suitable evaluation metrics, apply two existing systems for quotation attribution, discuss their results to evaluate the utility of our dataset and outline use cases of our dataset in downstream tasks.
翻译:在当今如在线新闻文章等数据泛滥的时代,提取谁对谁说了什么是分析人类沟通的关键环节。然而,德语新闻文章中此类任务缺乏标注数据,严重限制了潜在系统的质量与可用性。为解决这一问题,我们基于WIKINEWS提出一个新的、免费可用、采用知识共享许可的德语新闻文章引用归属数据集。该数据集采用细粒度标注模式,在1000篇文档(25万词)上提供了经过精心策划的高质量标注,支持数据集在下游任务中的多种应用。标注不仅明确了谁说了什么,还标注了说话方式、语境、对象以及引用类型。我们详细说明了标注模式,描述了数据集的创建过程,并提供了定量分析。此外,我们介绍了合适的评估指标,应用了两个现有引用归属系统,讨论其结果以评估数据集的实用性,并概述了该数据集在下游任务中的应用场景。