Quotation extraction is a widely useful task both from a sociological and from a Natural Language Processing perspective. However, very little data is available to study this task in languages other than English. In this paper, we present a manually annotated corpus of 1676 newswire texts in French for quotation extraction and source attribution. We first describe the composition of our corpus and the choices that were made in selecting the data. We then detail the annotation guidelines and annotation process, as well as a few statistics about the final corpus and the obtained balance between quote types (direct, indirect and mixed, which are particularly challenging). We end by detailing our inter-annotator agreement between the 8 annotators who worked on manual labelling, which is substantially high for such a difficult linguistic phenomenon.
翻译:引文提取是一项在社会学视角和自然语言处理领域都具有广泛应用价值的任务。然而,除英语外,针对其他语言进行这项任务研究所能获取的数据极为有限。本文提出了一个包含1,676篇法语新闻文本的、用于引文提取和来源归属的人工标注语料库。我们首先阐述了该语料库的构成以及在数据选择时所做的决策。随后详细介绍了标注指南与标注流程,并提供了关于最终语料库的若干统计数据,以及在引文类型(直接引语、间接引语和混合引语,后者尤为具有挑战性)之间所获得的平衡性。最后,我们详细说明了参与人工标注的8位标注者之间的标注者间一致性——就如此复杂的语言现象而言,这一一致性达到了相当高的水平。