Text summarization has a wide range of applications in many scenarios. The evaluation of the quality of the generated text is a complex problem. A big challenge to language evaluation is that there is a clear divergence between existing metrics and human evaluation. For example, the quality of a document summary can be measured by human annotators from both objective aspects, such as grammatical and semantic correctness, as well as subjective dimensions, such as comprehensiveness, succinctness, and interestingness. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to capture the above dimensions well. In this paper, we propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects. First, we propose to model objective and subjective dimensions of generated text based on roleplayers prompting mechanism. Furthermore, we introduce a context-based prompting mechanism that is able to generate dynamic roleplayer profiles based on input context. Finally, we design a multi-roleplayer prompting technology based on batch prompting to integrate multiple evaluation results into evaluation results. Experimental results on two real datasets for summarization show that our model is highly competitive and has a very high consistency with human annotators.
翻译:文本摘要在众多场景中具有广泛应用,而生成文本质量的评估则是一个复杂问题。语言评估面临的一个重大挑战是现有指标与人工评估之间存在显著差异。例如,文档摘要的质量可由人类标注者从语法和语义正确性等客观维度,以及全面性、简洁性和趣味性等主观维度进行衡量。然而,BLUE/ROUGE等大多数自动评估方法可能无法充分捕捉上述维度。本文提出了一种基于大语言模型的新型评估框架,通过从客观和主观两方面比较生成文本与参考文本,提供全面的评估体系。首先,我们基于角色扮演提示机制对生成文本的客观与主观维度进行建模。其次,引入一种基于上下文的提示机制,能够根据输入上下文生成动态的角色档案。最后,设计了一种基于批量提示的多角色提示技术,将多个评估结果整合为最终评估结论。在两个真实摘要数据集上的实验表明,我们的模型具有高度竞争力,且与人工标注者的一致性极高。