While coreference resolution is attracting more interest than ever from computational literature researchers, representative datasets of fully annotated long documents remain surprisingly scarce. In this paper, we introduce a new annotated corpus of three full-length French novels, totaling over 285,000 tokens. Unlike previous datasets focused on shorter texts, our corpus addresses the challenges posed by long, complex literary works, enabling evaluation of coreference models in the context of long reference chains. We present a modular coreference resolution pipeline that allows for fine-grained error analysis. We show that our approach is competitive and scales effectively to long documents. Finally, we demonstrate its usefulness to infer the gender of fictional characters, showcasing its relevance for both literary analysis and downstream NLP tasks.
翻译:尽管指代消解研究正吸引着计算文学研究者前所未有的关注,但完整标注的长文档代表性数据集仍然惊人地匮乏。本文介绍了一个包含三部法语长篇小说的新型标注语料库,总计超过28.5万个词元。与以往聚焦短文本的数据集不同,本语料库针对长篇复杂文学作品带来的挑战,能够在长指代链的语境下评估指代消解模型。我们提出了一种模块化的指代消解流程,支持细粒度的错误分析。实验表明我们的方法具有竞争力,并能有效扩展至长文档处理。最后,我们通过推断虚构人物性别的应用案例,证明了该方法对文学分析和下游自然语言处理任务的双重价值。