Recent large language models (LLMs) have shown remarkable performance in aligning generated text with user intentions across various tasks. When it comes to long-form text generation, there has been a growing interest in generation from a discourse coherence perspective. However, existing lexical or semantic metrics such as BLEU, ROUGE, BertScore cannot effectively capture the discourse coherence. The development of discourse-specific automatic evaluation methods for assessing the output of LLMs warrants greater focus and exploration. In this paper, we present a novel automatic metric designed to quantify the discourse divergence between two long-form articles. Extensive experiments on three datasets from representative domains demonstrate that our metric aligns more closely with human preferences and GPT-4 coherence evaluation, outperforming existing evaluation methods.
翻译:近年来,大型语言模型(LLMs)在多种任务中展现出卓越的文本生成与用户意图对齐能力。针对长文本生成任务,从话语连贯性角度进行生成的研究日益受到关注。然而,现有的词汇或语义指标(如BLEU、ROUGE、BertScore)难以有效捕捉话语连贯性。开发面向话语特性的自动评估方法以评估LLMs输出结果,亟需更多关注与探索。本文提出一种新型自动评估指标,旨在量化两篇长篇文章之间的话语差异。在三个代表性领域数据集上的广泛实验表明,该指标与人类偏好及GPT-4连贯性评估结果展现出更高的一致性,性能优于现有评估方法。