Recently, natural language generation (NLG) evaluation has shifted from a single-aspect to a multi-aspect paradigm, allowing for a more accurate assessment. Large language models (LLMs) achieve superior performance on various NLG evaluation tasks. However, current work often employs the LLM to independently evaluate different aspects, which largely ignores the rich correlation between various aspects. To fill this research gap, in this work, we propose an NLG evaluation metric called CoAScore. Powered by LLMs, the CoAScore utilizes multi-aspect knowledge through a CoA (\textbf{C}hain-\textbf{o}f-\textbf{A}spects) prompting framework when assessing the quality of a certain aspect. Specifically, for a given aspect to evaluate, we first prompt the LLM to generate a chain of aspects that are relevant to the target aspect and could be useful for the evaluation. We then collect evaluation scores for each generated aspect, and finally, leverage the knowledge of these aspects to improve the evaluation of the target aspect. We evaluate CoAScore across five NLG evaluation tasks (e.g., summarization, dialog response generation, etc) and nine aspects (e.g., overall quality, relevance, coherence, etc). Our experimental findings highlight that, in comparison to individual aspect evaluation, CoAScore exhibits a higher correlation with human judgments. This improvement significantly outperforms existing unsupervised evaluation metrics, whether for assessing overall quality or other aspects. We also conducted extensive ablation studies to validate the effectiveness of the three stages within the CoAScore framework and conducted case studies to show how the LLM performs in these stages. Our code and scripts are available.
翻译:近期,自然语言生成(NLG)评估已从单方面范式转向多方面范式,从而能够进行更准确的评估。大语言模型(LLM)在各类NLG评估任务中展现出卓越性能。然而,现有工作通常采用LLM独立评估不同方面,这在很大程度上忽略了各方面之间的丰富关联性。为填补这一研究空白,我们提出名为CoAScore的NLG评估指标。该指标借助LLM,在评估特定方面质量时,通过CoA(链式方面)提示框架利用多方面的知识。具体而言,对于待评估的给定方面,我们首先提示LLM生成与目标方面相关且有助于评估的方面链;随后收集每个生成方面的评估分数;最终利用这些方面的知识来改进目标方面的评估。我们跨越五项NLG评估任务(如摘要生成、对话响应生成等)及九个评估方面(如整体质量、相关性、连贯性等)对CoAScore进行了评估。实验结果表明,与独立方面评估相比,CoAScore与人类判断的一致性更高。这一改进显著超越了现有无监督评估指标,无论是在整体质量评估还是其他方面评估中。我们还进行了广泛的消融研究以验证CoAScore框架中三个阶段的效能,并通过案例研究展示了LLM在这些阶段中的表现。我们的代码与脚本已公开提供。