Evaluating the conversational abilities of large language models (LLMs) remains a challenging task. Current mainstream approaches primarily rely on the "LLM-as-a-judge" paradigm, where an LLM is prompted to serve as an evaluator to assess dialogue quality. However, such methods often suffer from various biases, which undermine the reliability and consistency of the evaluation results. To mitigate these biases, recent methods employ multiple LLMs as judges and aggregate their judgments to select the optimal assessment. Although effective, this multi-judge approach incurs significant computational overhead during inference. In this paper, we propose an efficient dialogue evaluator that captures the collective wisdom of multiple LLM judges by aggregating their preference knowledge into a single model. Our approach preserves the advantages of diverse multi-judge feedback while drastically reducing the evaluation cost, enabling fast, flexible, and fine-grained dialogue quality assessment. Extensive experiments on seven single rating and pairwise comparison dialogue evaluation benchmarks demonstrate that our method outperforms existing baselines across diverse scenarios, showcasing its efficiency and robustness.
翻译:评估大型语言模型(LLM)的对话能力仍然是一项具有挑战性的任务。当前的主流方法主要依赖于“LLM即评判者”范式,即通过提示LLM作为评估器来评判对话质量。然而,此类方法常受多种偏差影响,从而损害评估结果的可靠性与一致性。为减轻这些偏差,近期方法采用多个LLM作为评判者,并汇总其判断以选择最优评估。尽管有效,这种多评判者方法在推理过程中会产生显著的计算开销。本文提出一种高效的对话评估器,通过将多个LLM评判者的偏好知识聚合到单一模型中,从而捕捉其集体智慧。我们的方法在保持多样化多评判者反馈优势的同时,大幅降低了评估成本,实现了快速、灵活且细粒度的对话质量评估。在七个单评分和成对比较对话评估基准上的大量实验表明,我们的方法在多种场景下均优于现有基线,展现了其高效性与鲁棒性。