Automatic evaluation of generated textual content presents an ongoing challenge within the field of NLP. Given the impressive capabilities of modern language models (LMs) across diverse NLP tasks, there is a growing trend to employ these models in creating innovative evaluation metrics for automated assessment of generation tasks. This paper investigates a pivotal question: Do language model-driven evaluation metrics inherently exhibit bias favoring texts generated by the same underlying language model? Specifically, we assess whether prominent LM-based evaluation metrics--namely, BARTScore, T5Score, and GPTScore--demonstrate a favorable bias toward their respective underlying LMs in the context of summarization tasks. Our findings unveil a latent bias, particularly pronounced when such evaluation metrics are used in an reference-free manner without leveraging gold summaries. These results underscore that assessments provided by generative evaluation models can be influenced by factors beyond the inherent text quality, highlighting the necessity of developing more dependable evaluation protocols in the future.
翻译:对生成文本内容的自动评估是自然语言处理领域中的一个持续挑战。鉴于现代语言模型在各式自然语言处理任务中展现出的卓越能力,利用这些模型创建创新的评估指标以自动评估生成任务成为一种日益增长的趋势。本文研究了一个关键问题:由语言模型驱动的评估指标是否天生偏向于同一语言模型生成的文本?具体而言,我们评估了基于语言模型的著名评估指标——BARTScore、T5Score和GPTScore——在摘要任务中对其各自的底层语言模型是否展现出有利偏差。我们的研究结果揭示了一种潜在偏差,尤其当这些评估指标以无参考方式使用且不依赖金标准摘要时更为显著。这些结果表明,生成式评估模型提供的评分可能受文本内在质量之外的因素影响,凸显了未来开发更可靠评估协议的必要性。