Dataset of Quotation Attribution in German News Articles

Extracting who says what to whom is a crucial part in analyzing human communication in today's abundance of data such as online news articles. Yet, the lack of annotated data for this task in German news articles severely limits the quality and usability of possible systems. To remedy this, we present a new, freely available, creative-commons-licensed dataset for quotation attribution in German news articles based on WIKINEWS. The dataset provides curated, high-quality annotations across 1000 documents (250,000 tokens) in a fine-grained annotation schema enabling various downstream uses for the dataset. The annotations not only specify who said what but also how, in which context, to whom and define the type of quotation. We specify our annotation schema, describe the creation of the dataset and provide a quantitative analysis. Further, we describe suitable evaluation metrics, apply two existing systems for quotation attribution, discuss their results to evaluate the utility of our dataset and outline use cases of our dataset in downstream tasks.

翻译：在当今如在线新闻文章等数据泛滥的时代，提取谁对谁说了什么是分析人类沟通的关键环节。然而，德语新闻文章中此类任务缺乏标注数据，严重限制了潜在系统的质量与可用性。为解决这一问题，我们基于WIKINEWS提出一个新的、免费可用、采用知识共享许可的德语新闻文章引用归属数据集。该数据集采用细粒度标注模式，在1000篇文档（25万词）上提供了经过精心策划的高质量标注，支持数据集在下游任务中的多种应用。标注不仅明确了谁说了什么，还标注了说话方式、语境、对象以及引用类型。我们详细说明了标注模式，描述了数据集的创建过程，并提供了定量分析。此外，我们介绍了合适的评估指标，应用了两个现有引用归属系统，讨论其结果以评估数据集的实用性，并概述了该数据集在下游任务中的应用场景。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日