The emergence of Large Language Models (LLMs) as chat assistants capable of generating human-like conversations has amplified the need for robust evaluation methods, particularly for open-ended tasks. Conventional metrics like BLEU and ROUGE, while useful, are increasingly inadequate for capturing the subtle semantics and contextual richness of such generative outputs. We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs-as-judges. Through experiments on three open-ended question-answering tasks, we demonstrate that combining multiple LLMs-as-judges significantly improves the reliability and accuracy of evaluations, particularly in complex tasks where a single model might struggle. Our findings reveal a strong correlation with human evaluations, establishing our method as a viable and effective alternative to traditional metrics and human judgments, particularly in the context of LLM-based chat assistants where the complexity and diversity of responses challenge existing benchmarks.
翻译:大型语言模型(LLMs)作为能够生成类人对话的聊天助手的出现,增强了对稳健评估方法的需求,尤其是针对开放式任务。虽然传统指标如BLEU和ROUGE具有一定作用,但越来越难以捕捉此类生成式输出的微妙语义和上下文丰富性。我们提出一种参考引导的裁决方法,通过利用多个LLMs作为评判者来自动化评估过程。通过在三个开放式问答任务上的实验,我们证明结合多个LLMs作为评判者能显著提升评估的可靠性和准确性,尤其在单个模型可能难以应对的复杂任务中。我们的研究结果显示该方法与人工评估存在强相关性,确立了其作为传统指标和人工判断的可行且有效的替代方案,特别是在基于LLM的聊天助手场景中,响应的复杂性和多样性对现有基准构成了挑战。