Modeling discourse -- the linguistic phenomena that go beyond individual sentences, is a fundamental yet challenging aspect of natural language processing (NLP). However, existing evaluation benchmarks primarily focus on the evaluation of inter-sentence properties and overlook critical discourse phenomena that cross sentences. To bridge the gap, we propose Disco-Bench, a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks, covering understanding, translation, and generation. Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena (e.g. cohesion and coherence) in Chinese and/or English. For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge. We totally evaluate 20 general-, in-domain and commercial models based on Transformer, advanced pretraining architectures and large language models (LLMs). Our results show (1) the challenge and necessity of our evaluation benchmark; (2) fine-grained pretraining based on literary document-level training data consistently improves the modeling of discourse information. We will release the datasets, pretrained models, and leaderboard, which we hope can significantly facilitate research in this field: https://github.com/longyuewangdcu/Disco-Bench.
翻译:语篇建模——即超越单句的语言现象,是自然语言处理(NLP)中基础而具有挑战性的方面。然而,现有评估基准主要聚焦于句子内部属性的评估,忽视了跨句子的关键语篇现象。为填补这一空白,我们提出Disco-Bench,一个能够跨多种NLP任务(涵盖理解、翻译与生成)评估句子内部语篇属性的基准。Disco-Bench包含文献领域9个文档级测试集,这些测试集涵盖中文和/或英文中丰富的语篇现象(如衔接与连贯)。为进行语言学分析,我们还设计了一套诊断性测试套件,用于检验目标模型是否学习到语篇知识。我们全面评估了基于Transformer、先进预训练架构和大语言模型(LLMs)的20个通用、领域内及商业模型。研究结果表明:(1)我们的评估基准具有挑战性与必要性;(2)基于文学文档级训练数据的细粒度预训练能持续提升语篇信息的建模能力。我们将公开数据集、预训练模型及排行榜,期望能显著推动该领域的研究:https://github.com/longyuewangdcu/Disco-Bench。