Nowadays, the quality of responses generated by different modern large language models (LLMs) is hard to evaluate and compare automatically. Recent studies suggest and predominantly use LLMs for reference-free evaluation of open-ended question answering. More specifically, they use the recognized "strongest" LLM as the evaluator, which conducts pairwise comparisons of candidate models' answers and provides a ranking score. However, this intuitive method has multiple problems, such as bringing in self-enhancement (favoring its own answers) and positional bias. We draw insights and lessons from the educational domain (Cho & MacArthur, 2011; Walsh, 2014) to improve LLM-based evaluations. Specifically, we propose (1) the peer rank (PR) algorithm that takes into account each peer LLM's pairwise preferences of all answer pairs, and outputs a final ranking of models; and (2) peer discussion (PD), where we prompt two LLMs to discuss and try to reach a mutual agreement on the preferences of two answers. We conduct experiments on two benchmark datasets. We find that our approaches achieve higher accuracy and align better with human judgments. Interestingly, PR can induce a relatively accurate self-ranking of models under the anonymous setting, where each model's name is unrevealed. Our work provides space to explore evaluating models that are hard to compare for humans.
翻译:当前,不同现代大语言模型(LLM)生成的回答质量难以自动评估与比较。近期研究提出并主要采用LLM进行开放域问答的无参考评估。具体而言,这些研究使用公认的“最强”LLM作为评估器,对候选模型的答案进行两两比较并提供排序分数。然而,这种直观方法存在多个问题,例如引入自我增强(偏向自身答案)和位置偏差。我们从教育领域(Cho & MacArthur, 2011; Walsh, 2014)汲取见解与经验以改进基于LLM的评估。具体来说,我们提出:(1)同行排名(PR)算法,该算法综合考虑每个同行LLM对所有答案对的成对偏好,并输出模型的最终排序;(2)同行讨论(PD),即提示两个LLM进行讨论,尝试就两个答案的偏好达成共识。我们在两个基准数据集上开展实验,发现我们的方法获得了更高的准确率且与人类判断更吻合。有趣的是,在匿名设置(各模型名称被隐藏)下,PR能够引导出相对准确的模型自排序。本研究为探索评估人类难以比较的模型提供了新的空间。