Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human evaluation processes often involve multiple human annotators collaborating in the evaluation, we resort to a multi-agent debate framework, moving beyond single-agent prompting strategies. The multi-agent-based approach enables a group of LLMs to synergize with an array of intelligent counterparts, harnessing their distinct capabilities and expertise to enhance efficiency and effectiveness in handling intricate tasks. In this paper, we construct a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models on open-ended questions and traditional natural language generation (NLG) tasks. Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments. Our code is available at https://github.com/chanchimin/ChatEval.
翻译:文本评估历来面临重大挑战,通常需要耗费大量人力和时间成本。随着大语言模型(LLMs)的出现,研究者们探索了其作为人类评估替代方案的潜力。尽管这些基于单智能体的方法展现出前景,但实验结果表明,仍需进一步推进以缩小其当前效能与人类级评估质量之间的差距。鉴于人类评估过程中的最佳实践往往涉及多名评估者协作,我们转向多智能体辩论框架,超越单智能体提示策略。基于多智能体的方法使一组LLM能够与众多智能协同体高效协作,利用其各自独特的能力与专长,提升处理复杂任务的效率与效果。本文构建了名为ChatEval的多智能体评审团队,使其能够自主讨论并评估不同模型在开放性问题及传统自然语言生成(NLG)任务中生成回答的质量。分析表明,ChatEval超越了单纯的文本评分,提供了模拟人类评估过程的可靠评估机制。我们的代码已开源至https://github.com/chanchimin/ChatEval。