Evaluating Natural Language Generation (NLG) outputs is crucial but laborious and expensive. While various automatic NLG assessment methods have been proposed, they often are quite task-specific and have to be engineered with a particular domain and attribute in mind. In this work, we propose a robust zero-shot approach to NLG evaluation using pairwise comparative judgment with open-source Large Language Models (LLMs). The motivation for this approach is that even as humans, it is easier to determine which of two options are better, than it is to independently objectively score each option. We use this insight and leverage the emergent abilities of LLMs, where we probe FlanT5 to determine which of two candidate responses is better, rather than assigning absolute scores. Our results demonstrate that comparative assessment is a more effective approach than absolute scoring, enabling smaller open-source LLMs to achieve comparable performance to larger public access APIs. We evaluate systems on both summary evaluation and dialogue response generation, and show that opensource LLMs can lead to good correlations with human scores for a range of different attributes.
翻译:评估自然语言生成(NLG)输出至关重要,但既费时又昂贵。尽管已有多种自动NLG评估方法被提出,但它们往往具有较高的任务特异性,且需要针对特定领域和属性进行专门设计。本研究提出了一种鲁棒的零样本NLG评估方法,利用开源大语言模型(LLMs)进行配对比较判断。该方法的动机在于:即使对人类而言,从两个选项中判断哪个更优,也比独立地给每个选项客观评分更容易。我们基于这一洞察,利用LLMs的新兴能力,通过探测FlanT5模型来判定两个候选响应中哪个更好,而非赋予绝对分数。结果表明,比较式评估比绝对评分更有效,能使较小的开源LLM达到与大型公开API相当的性能。我们在摘要评估和对话响应生成两个任务上对系统进行了评测,证明开源LLM能在多个不同属性上与人类评分保持良好相关性。