Recently, the emergence of ChatGPT has attracted wide attention from the computational linguistics community. Many prior studies have shown that ChatGPT achieves remarkable performance on various NLP tasks in terms of automatic evaluation metrics. However, the ability of ChatGPT to serve as an evaluation metric is still underexplored. Considering assessing the quality of NLG models is an arduous task and previous statistical metrics notoriously show their poor correlation with human judgments, we wonder whether ChatGPT is a good NLG evaluation metric. In this report, we provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric. In detail, we regard ChatGPT as a human evaluator and give task-specific (e.g., summarization) and aspect-specific (e.g., relevance) instruction to prompt ChatGPT to score the generation of NLG models. We conduct experiments on three widely-used NLG meta-evaluation datasets (including summarization, story generation and data-to-text tasks). Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with golden human judgments. We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.
翻译:近期,ChatGPT 的出现引起了计算语言学界的广泛关注。许多先前研究表明,在自动评估指标方面,ChatGPT 在各类自然语言处理任务中取得了显著性能。然而,ChatGPT 作为评估指标的能力尚未得到充分探究。考虑到自然语言生成模型质量的评估是一项艰巨任务,且以往的统计指标与人工判断的相关性较差,我们不禁思考:ChatGPT 能否成为一种优秀的自然语言生成评估指标?本报告对 ChatGPT 进行了初步的元评估,以验证其作为自然语言生成指标的可靠性。具体而言,我们将 ChatGPT 视为人工评估者,并给出任务特定(如摘要生成)和方面特定(如相关性)的指令,引导其评分自然语言生成模型的输出。我们基于三个广泛使用的自然语言生成元评估数据集(涵盖摘要生成、故事生成与数据到文本任务)进行实验。结果表明,与以往的自动指标相比,ChatGPT 与人工黄金判断的相关性达到了当前最优或具有竞争力的水平。我们希望这项初步研究能推动通用型可靠自然语言生成指标的出现。