The automatic evaluation of natural language generation (NLG) systems presents a long-lasting challenge. Recent studies have highlighted various neural metrics that align well with human evaluations. Yet, the robustness of these evaluators against adversarial perturbations remains largely under-explored due to the unique challenges in obtaining adversarial data for different NLG evaluation tasks. To address the problem, we introduce AdvEval, a novel black-box adversarial framework against NLG evaluators. AdvEval is specially tailored to generate data that yield strong disagreements between human and victim evaluators. Specifically, inspired by the recent success of large language models (LLMs) in text generation and evaluation, we adopt strong LLMs as both the data generator and gold evaluator. Adversarial data are automatically optimized with feedback from the gold and victim evaluator. We conduct experiments on 12 victim evaluators and 11 NLG datasets, spanning tasks including dialogue, summarization, and question evaluation. The results show that AdvEval can lead to significant performance degradation of various victim metrics, thereby validating its efficacy.
翻译:自然语言生成(NLG)系统的自动评估是一个长期存在的挑战。近期研究强调了多种与人类评估高度一致的神经度量标准。然而,由于为不同NLG评估任务获取对抗性数据存在独特挑战,这些评估器对抗扰动的鲁棒性在很大程度上仍未得到充分探索。为解决此问题,我们提出了AdvEval,一种针对NLG评估器的新型黑盒对抗框架。AdvEval专门设计用于生成能在人类评估者与受测评估器之间产生强烈分歧的数据。具体而言,受大语言模型(LLMs)在文本生成与评估方面近期成功的启发,我们采用强大的LLMs同时作为数据生成器和黄金评估器。对抗性数据通过黄金评估器与受测评估器的反馈进行自动优化。我们在12个受测评估器和11个NLG数据集上进行了实验,涵盖对话、摘要和问题评估等任务。结果表明,AdvEval能够导致多种受测度量标准的性能显著下降,从而验证了其有效性。