The majority of automatic metrics for evaluating NLG systems are reference-based. However, the challenge of collecting human annotation results in a lack of reliable references in numerous application scenarios. Despite recent advancements in reference-free metrics, it has not been well understood when and where they can be used as an alternative to reference-based metrics. In this study, by employing diverse analytical approaches, we comprehensively assess the performance of both metrics across a wide range of NLG tasks, encompassing eight datasets and eight evaluation models. Based on solid experiments, the results show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality. However, their effectiveness varies across tasks and is influenced by the quality of candidate texts. Therefore, it's important to assess the performance of reference-free metrics before applying them to a new task, especially when inputs are in uncommon form or when the answer space is highly variable. Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
翻译:绝大多数用于评估自然语言生成系统的自动指标都基于参考文本。然而,人工标注结果采集的困难导致众多应用场景缺乏可靠的参考文本。尽管无参考指标近年来取得进展,但目前尚未充分理解它们何时何地可作为有参考指标的替代方案。本研究通过采用多种分析方法,在涵盖八个数据集和八个评估模型的广泛自然语言生成任务中,系统评估了两类指标的性能。基于严谨实验的结果表明:无参考指标与人工判断的相关性更高,且对语言质量缺陷的敏感度更强。但其有效性随任务类型变化,并受候选文本质量的影响。因此,在将无参考指标应用于新任务前(特别是输入形式不常见或答案空间高度可变时),评估其性能至关重要。本研究可为自动指标的合理使用及指标选择对评估性能的影响提供重要参考。