Recently, there has been a growing interest in designing text generation systems from a discourse coherence perspective, e.g., modeling the interdependence between sentences. Still, recent BERT-based evaluation metrics are weak in recognizing coherence, and thus are not reliable in a way to spot the discourse-level improvements of those text generation systems. In this work, we introduce DiscoScore, a parametrized discourse metric, which uses BERT to model discourse coherence from different perspectives, driven by Centering theory. Our experiments encompass 16 non-discourse and discourse metrics, including DiscoScore and popular coherence models, evaluated on summarization and document-level machine translation (MT). We find that (i) the majority of BERT-based metrics correlate much worse with human rated coherence than early discourse metrics, invented a decade ago; (ii) the recent state-of-the-art BARTScore is weak when operated at system level -- which is particularly problematic as systems are typically compared in this manner. DiscoScore, in contrast, achieves strong system-level correlation with human ratings, not only in coherence but also in factual consistency and other aspects, and surpasses BARTScore by over 10 correlation points on average. Further, aiming to understand DiscoScore, we provide justifications to the importance of discourse coherence for evaluation metrics, and explain the superiority of one variant over another. Our code is available at \url{https://github.com/AIPHES/DiscoScore}.
翻译:近年来,从语篇连贯性角度设计文本生成系统(如建模句子间的相互依赖关系)日益受到关注。然而,现有的BERT评估指标在识别连贯性方面存在不足,因此无法可靠地捕捉这些文本生成系统在语篇层面的改进。本文提出参数化语篇指标DiscoScore,该指标基于中心理论,利用BERT从不同角度建模语篇连贯性。实验涵盖包括DiscoScore及主流连贯性模型在内的16种非语篇与语篇指标,并在摘要生成与文档级机器翻译任务上进行评估。研究发现:(i)多数基于BERT的指标与人工评分的连贯性相关性远低于十年前提出的早期语篇指标;(ii)近期先进的BARTScore在系统级评估中表现薄弱——鉴于系统通常以此方式比较,此问题尤为严重。相比之下,DiscoScore在系统级与人工评分达到强相关性,不仅覆盖连贯性,在事实一致性等其他维度亦表现优异,平均超越BARTScore超10个相关性百分点。此外,为深入理解DiscoScore,我们论证了语篇连贯性对评估指标的重要性,并阐释了不同变体的性能优势。相关代码开源于\url{https://github.com/AIPHES/DiscoScore}。