As natural language generation (NLG) models have become prevalent, systematically assessing the quality of machine-generated texts has become increasingly important. Recent studies introduce LLM-based evaluators that operate as reference-free metrics, demonstrating their capability to adeptly handle novel tasks. However, these models generally rely on a single-agent approach, which, we argue, introduces an inherent limit to their performance. This is because there exist biases in LLM agent's responses, including preferences for certain text structure or content. In this work, we propose DEBATE, an NLG evaluation framework based on multi-agent scoring system augmented with a concept of Devil's Advocate. Within the framework, one agent is instructed to criticize other agents' arguments, potentially resolving the bias in LLM agent's answers. DEBATE substantially outperforms the previous state-of-the-art methods in two meta-evaluation benchmarks in NLG evaluation, SummEval and TopicalChat. We also show that the extensiveness of debates among agents and the persona of an agent can influence the performance of evaluators.
翻译:随着自然语言生成(NLG)模型的普及,系统性地评估机器生成文本的质量变得日益重要。近期研究引入了基于大型语言模型(LLM)的评估器,这些评估器作为无参考指标运行,展现了其熟练处理新任务的能力。然而,这些模型通常依赖于单智能体方法,我们认为这为其性能带来了固有的限制。这是因为LLM智能体的响应存在偏差,包括对特定文本结构或内容的偏好。在本工作中,我们提出了DEBATE,一种基于多智能体评分系统并结合魔鬼代言人概念的NLG评估框架。在该框架中,一个智能体被指示对其他智能体的论点进行批判,从而可能解决LLM智能体回答中的偏差问题。DEBATE在NLG评估的两个元评估基准(SummEval和TopicalChat)中显著超越了先前的最先进方法。我们还表明,智能体间辩论的广泛性以及智能体的人格设定会影响评估器的性能。