Is ChatGPT a Good NLG Evaluator? A Preliminary Study

Recently, the emergence of ChatGPT has attracted wide attention from the computational linguistics community. Many prior studies have shown that ChatGPT achieves remarkable performance on various NLP tasks in terms of automatic evaluation metrics. However, the ability of ChatGPT to serve as an evaluation metric is still underexplored. Considering assessing the quality of NLG models is an arduous task and previous statistical metrics notoriously show their poor correlation with human judgments, we wonder whether ChatGPT is a good NLG evaluation metric. In this report, we provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric. In detail, we regard ChatGPT as a human evaluator and give task-specific (e.g., summarization) and aspect-specific (e.g., relevance) instruction to prompt ChatGPT to score the generation of NLG models. We conduct experiments on three widely-used NLG meta-evaluation datasets (including summarization, story generation and data-to-text tasks). Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with golden human judgments. We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.

翻译：近期，ChatGPT 的出现引起了计算语言学界的广泛关注。许多先前研究表明，在自动评估指标方面，ChatGPT 在各类自然语言处理任务中取得了显著性能。然而，ChatGPT 作为评估指标的能力尚未得到充分探究。考虑到自然语言生成模型质量的评估是一项艰巨任务，且以往的统计指标与人工判断的相关性较差，我们不禁思考：ChatGPT 能否成为一种优秀的自然语言生成评估指标？本报告对 ChatGPT 进行了初步的元评估，以验证其作为自然语言生成指标的可靠性。具体而言，我们将 ChatGPT 视为人工评估者，并给出任务特定（如摘要生成）和方面特定（如相关性）的指令，引导其评分自然语言生成模型的输出。我们基于三个广泛使用的自然语言生成元评估数据集（涵盖摘要生成、故事生成与数据到文本任务）进行实验。结果表明，与以往的自动指标相比，ChatGPT 与人工黄金判断的相关性达到了当前最优或具有竞争力的水平。我们希望这项初步研究能推动通用型可靠自然语言生成指标的出现。

相关内容

ChatGPT

关注 258

ChatGPT（全名：Chat Generative Pre-trained Transformer），美国OpenAI 研发的聊天机器人程序 [1] ，于2022年11月30日发布。ChatGPT是人工智能技术驱动的自然语言处理工具，它能够通过学习和理解人类的语言来进行对话，还能根据聊天的上下文进行互动，真正像人类一样来聊天交流，甚至能完成撰写邮件、视频脚本、文案、翻译、代码，写论文任务。 [1] https://openai.com/blog/chatgpt/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日