In this work, we explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics: stress tests with synthetic data. Basically, we design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores. We examine a range of recently proposed evaluation metrics based on pretrained language models, for the tasks of open-ended generation, translation, and summarization. Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics. For example, we find that BERTScore is confused by truncation errors in summarization, and MAUVE (built on top of GPT-2) is insensitive to errors at the beginning or middle of generations. Further, we investigate the reasons behind these blind spots and suggest practical workarounds for a more reliable evaluation of text generation. We have released our code and data at https://github.com/cloudygoose/blindspot_nlg.
翻译:本文探索了一种有用但常被忽视的文本生成评估指标鲁棒性分析方法:基于合成数据的压力测试。具体而言,我们设计并合成了多种潜在错误类型,并检验这些错误是否导致指标分数出现相应下降。针对开放式生成、机器翻译和文本摘要任务,我们考察了近期基于预训练语言模型提出的多项评估指标。实验揭示了现有指标中值得关注的感知盲区、系统性偏差甚至逻辑漏洞。例如,我们发现BERTScore无法有效识别摘要任务中的截断错误,而基于GPT-2构建的MAUVE指标对生成文本开头或中间位置的错误不敏感。进一步地,我们探究了这些盲点背后的成因,并提出了提升文本生成评估可靠性的实用改进方案。相关代码与数据集已开源至https://github.com/cloudygoose/blindspot_nlg。