Traditional evaluation metrics like BLEU and ROUGE fall short when capturing the nuanced qualities of generated text, particularly when there is no single ground truth. In this paper, we explore the potential of Large Language Models (LLMs), specifically Google Gemini 1, to serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks. We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments on the SummEval and USR datasets, asking the model to generate both a score as well as a justification for the score. Furthermore, we explore the robustness of the LLM evaluator by using perturbed inputs. Our findings suggest that while LLMs show promise, their alignment with human evaluators is limited, they are not robust against perturbations and significant improvements are required for their standalone use as reliable evaluators for subjective metrics.
翻译:传统评估指标如BLEU和ROUGE在捕捉生成文本的细微质量方面存在不足,特别是在缺乏单一标准答案的情况下。本文探索了大型语言模型(LLMs),特别是Google Gemini 1,在摘要和基于对话的任务中作为非标准化指标自动评估器的潜力。我们通过多种提示策略进行实验,以检验LLMs作为质量评估器与人类在SummEval和USR数据集上的判断相比表现如何,要求模型同时生成分数及其理由。此外,我们通过使用扰动输入来探究LLM评估器的鲁棒性。我们的研究结果表明,虽然LLMs展现出潜力,但其与人类评估者的一致性有限,对扰动不具备鲁棒性,要将其作为主观指标的可靠评估器独立使用,仍需重大改进。