Evaluating natural language generation (NLG) is a vital but challenging problem in artificial intelligence. Traditional evaluation metrics mainly capturing content (e.g. n-gram) overlap between system outputs and references are far from satisfactory, and large language models (LLMs) such as ChatGPT have demonstrated great potential in NLG evaluation in recent years. Various automatic evaluation methods based on LLMs have been proposed, including metrics derived from LLMs, prompting LLMs, and fine-tuning LLMs with labeled evaluation data. In this survey, we first give a taxonomy of LLM-based NLG evaluation methods, and discuss their pros and cons, respectively. We also discuss human-LLM collaboration for NLG evaluation. Lastly, we discuss several open problems in this area and point out future research directions.
翻译:评估自然语言生成是人工智能中至关重要但颇具挑战性的问题。传统评估指标主要捕捉系统输出与参考文本之间的内容重叠(如n-gram),其效果远不能令人满意,而近年来ChatGPT等大语言模型在自然语言生成评估中展现出巨大潜力。目前已提出多种基于大语言模型的自动评估方法,包括从大语言模型中导出的评估指标、对大语言模型进行提示、以及使用标注评估数据对大语言模型进行微调。本综述首先对基于大语言模型的自然语言生成评估方法进行了分类,并分别探讨了其优缺点。我们还讨论了人机协作在自然语言生成评估中的应用。最后,我们探讨了该领域的若干开放性问题,并指出了未来的研究方向。