Despite advances in Natural Language Generation (NLG), evaluation remains challenging. Although various new metrics and LLM-as-a-judge (LaaJ) methods are proposed, human judgment persists as the gold standard. To systematically review how NLG evaluation has evolved, we employ an automatic information extraction scheme to gather key information from NLG papers, focusing on different evaluation methods (metrics, LaaJ and human evaluation). With extracted metadata from 14,171 papers across four major conferences (ACL, EMNLP, NAACL, and INLG) over the past six years, we reveal several critical findings: (1) Task Divergence: While Dialogue Generation demonstrates a rapid shift toward LaaJ (>40% in 2025), Machine Translation remains locked into n-gram metrics, and Question Answering exhibits a substantial decline in the proportion of studies conducting human evaluation. (2) Metric Inertia: Despite the development of semantic metrics, general-purpose metrics (e.g., BLEU, ROUGE) continue to be widely used across tasks without empirical justification, often lacking the discriminative power to distinguish between specific quality criteria. (3) Human-LaaJ Divergence: Our association analysis challenges the assumption that LLMs act as mere proxies for humans; LaaJ and human evaluations prioritize very different signals, and explicit validation is scarce (<8% of papers comparing the two), with only moderate to low correlation. Based on these observations, we derive practical recommendations to improve the rigor of future NLG evaluation.
翻译:尽管自然语言生成(NLG)取得了进展,评估仍然具有挑战性。尽管提出了各种新指标和LLM-as-a-judge(LaaJ)方法,人类判断仍然是黄金标准。为了系统回顾NLG评估的演变,我们采用自动信息提取方案从NLG论文中收集关键信息,重点关注不同的评估方法(指标、LaaJ和人工评估)。通过对过去六年四大会议(ACL、EMNLP、NAACL和INLG)14,171篇论文提取的元数据,我们揭示了几个关键发现:(1)任务分化:尽管对话生成显示出向LaaJ的快速转变(2025年>40%),机器翻译仍固守于n-gram指标,而问答研究中执行人工评估的比例大幅下降。(2)指标惯性:尽管语义指标有所发展,通用指标(如BLEU、ROUGE)仍在各任务中被广泛使用,且缺乏实证依据,通常缺乏区分特定质量标准的判别能力。(3)人机评估分歧:我们的关联分析挑战了LLM仅作为人类代理的假设;LaaJ与人工评估优先考虑的信号差异显著,且明确的验证研究稀缺(<8%的论文对两者进行比较),相关性仅为中等至低度。基于这些观察,我们提出了改进未来NLG评估严谨性的实用建议。