Human evaluation is indispensable and inevitable for assessing the quality of texts generated by machine learning models or written by humans. However, human evaluation is very difficult to reproduce and its quality is notoriously unstable, hindering fair comparisons among different natural language processing (NLP) models and algorithms. Recently, large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided. In this paper, we explore if such an ability of the LLMs can be used as an alternative to human evaluation. We present the LLMs with the exact same instructions, samples to be evaluated, and questions used to conduct human evaluation, and then ask the LLMs to generate responses to those questions; we dub this LLM evaluation. We use human evaluation and LLM evaluation to evaluate the texts in two NLP tasks: open-ended story generation and adversarial attacks. We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation: the texts rated higher by human experts are also rated higher by the LLMs. We also find that the results of LLM evaluation are stable over different formatting of the task instructions and the sampling algorithm used to generate the answer. We are the first to show the potential of using LLMs to assess the quality of texts and discuss the limitations and ethical considerations of LLM evaluation.
翻译:人类评估对于评估机器学习模型生成或人类撰写的文本质量是不可或缺且不可避免的。然而,人类评估难以复现,且其质量众所周知地不稳定,这阻碍了不同自然语言处理(NLP)模型和算法之间的公平比较。近年来,大型语言模型(LLMs)在仅提供任务指令的情况下,对未见任务展现出了卓越的性能。本文探索LLMs的这种能力是否可以作为人类评估的替代方案。我们向LLMs提供与人类评估完全相同的指令、待评估样本以及用于进行人类评估的问题,然后要求LLMs生成对这些问题的响应;我们将此称为LLM评估。我们使用人类评估和LLM评估来评估两个NLP任务中的文本:开放式故事生成和对抗攻击。我们表明,LLM评估的结果与专家人类评估的结果一致:人类专家评分较高的文本,LLMs也给予更高评分。我们还发现,LLM评估的结果在任务指令的不同格式以及用于生成答案的采样算法下保持稳定。我们是首个展示利用LLMs评估文本质量的潜力,并讨论LLM评估的局限性及伦理考量的研究。