Automatic evaluation of generated textual content presents an ongoing challenge within the field of NLP. Given the impressive capabilities of modern language models (LMs) across diverse NLP tasks, there is a growing trend to employ these models in creating innovative evaluation metrics for automated assessment of generation tasks. This paper investigates a pivotal question: Do language model-driven evaluation metrics inherently exhibit bias favoring texts generated by the same underlying language model? Specifically, we assess whether prominent LM-based evaluation metrics (e.g. BARTScore, T5Score, and GPTScore) demonstrate a favorable bias toward their respective underlying LMs in the context of summarization tasks. Our findings unveil a latent bias, particularly pronounced when such evaluation metrics are used in an reference-free manner without leveraging gold summaries. These results underscore that assessments provided by generative evaluation models can be influenced by factors beyond the inherent text quality, highlighting the necessity of developing more dependable evaluation protocols in the future.
翻译:生成文本内容的自动评估是自然语言处理领域中的一个持续挑战。鉴于现代语言模型在各种NLP任务中展现出的卓越能力,利用这些模型创建创新的评估指标以自动评估生成任务正成为一种日益增长的趋势。本文探讨了一个关键问题:由语言模型驱动的评估指标是否天然倾向于偏爱由同一底层语言模型生成的文本?具体而言,我们评估了基于LM的知名评估指标(例如BARTScore、T5Score和GPTScore)在摘要任务中是否对其各自的底层LM表现出有利的偏见。我们的研究发现了一种潜在偏见,尤其在以无参考方式使用这些评估指标而不依赖黄金摘要时更为显著。这些结果强调,生成式评估模型提供的评价可能受到文本固有质量之外的因素影响,凸显了未来开发更可靠评估协议的必要性。