CoAScore: Chain-of-Aspects Prompting for NLG Evaluation

Recently, natural language generation (NLG) evaluation has shifted from a single-aspect to a multi-aspect paradigm, allowing for a more accurate assessment. Large language models (LLMs) achieve superior performance on various NLG evaluation tasks. However, current work often employs the LLM to independently evaluate different aspects, which largely ignores the rich correlation between various aspects. To fill this research gap, in this work, we propose an NLG evaluation metric called CoAScore. Powered by LLMs, the CoAScore utilizes multi-aspect knowledge through a CoA (\textbf{C}hain-\textbf{o}f-\textbf{A}spects) prompting framework when assessing the quality of a certain aspect. Specifically, for a given aspect to evaluate, we first prompt the LLM to generate a chain of aspects that are relevant to the target aspect and could be useful for the evaluation. We then collect evaluation scores for each generated aspect, and finally, leverage the knowledge of these aspects to improve the evaluation of the target aspect. We evaluate CoAScore across five NLG evaluation tasks (e.g., summarization, dialog response generation, etc) and nine aspects (e.g., overall quality, relevance, coherence, etc). Our experimental findings highlight that, in comparison to individual aspect evaluation, CoAScore exhibits a higher correlation with human judgments. This improvement significantly outperforms existing unsupervised evaluation metrics, whether for assessing overall quality or other aspects. We also conducted extensive ablation studies to validate the effectiveness of the three stages within the CoAScore framework and conducted case studies to show how the LLM performs in these stages. Our code and scripts are available.

翻译：近期，自然语言生成（NLG）评估已从单方面范式转向多方面范式，从而能够进行更准确的评估。大语言模型（LLM）在各类NLG评估任务中展现出卓越性能。然而，现有工作通常采用LLM独立评估不同方面，这在很大程度上忽略了各方面之间的丰富关联性。为填补这一研究空白，我们提出名为CoAScore的NLG评估指标。该指标借助LLM，在评估特定方面质量时，通过CoA（链式方面）提示框架利用多方面的知识。具体而言，对于待评估的给定方面，我们首先提示LLM生成与目标方面相关且有助于评估的方面链；随后收集每个生成方面的评估分数；最终利用这些方面的知识来改进目标方面的评估。我们跨越五项NLG评估任务（如摘要生成、对话响应生成等）及九个评估方面（如整体质量、相关性、连贯性等）对CoAScore进行了评估。实验结果表明，与独立方面评估相比，CoAScore与人类判断的一致性更高。这一改进显著超越了现有无监督评估指标，无论是在整体质量评估还是其他方面评估中。我们还进行了广泛的消融研究以验证CoAScore框架中三个阶段的效能，并通过案例研究展示了LLM在这些阶段中的表现。我们的代码与脚本已公开提供。

相关内容

大语言模型

关注 67

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日