In the rapidly evolving domain of Natural Language Generation (NLG) evaluation, introducing Large Language Models (LLMs) has opened new avenues for assessing generated content quality, e.g., coherence, creativity, and context relevance. This paper aims to provide a thorough overview of leveraging LLMs for NLG evaluation, a burgeoning area that lacks a systematic analysis. We propose a coherent taxonomy for organizing existing LLM-based evaluation metrics, offering a structured framework to understand and compare these methods. Our detailed exploration includes critically assessing various LLM-based methodologies, as well as comparing their strengths and limitations in evaluating NLG outputs. By discussing unresolved challenges, including bias, robustness, domain-specificity, and unified evaluation, this paper seeks to offer insights to researchers and advocate for fairer and more advanced NLG evaluation techniques.
翻译:在快速发展的自然语言生成(NLG)评估领域,大型语言模型(LLMs)的引入为评估生成内容的质量(如连贯性、创造性和上下文相关性)开辟了新途径。本文旨在对利用LLMs进行NLG评估这一缺乏系统性分析的新兴领域进行全面综述。我们提出了一个连贯的分类法来组织现有的基于LLM的评估指标,为理解和比较这些方法提供了一个结构化框架。我们的详细探讨包括对各种基于LLM的方法进行批判性评估,并比较它们在评估NLG输出时的优势与局限。通过讨论包括偏见、鲁棒性、领域特定性和统一评估在内的未解挑战,本文力求为研究人员提供见解,并倡导更公平、更先进的NLG评估技术。