Nowadays, the quality of responses generated by different modern large language models (LLMs) are hard to evaluate and compare automatically. Recent studies suggest and predominantly use LLMs as a reference-free metric for open-ended question answering. More specifically, they use the recognized "strongest" LLM as the evaluator, which conducts pairwise comparisons of candidate models' answers and provides a ranking score. However, this intuitive method has multiple problems, such as bringing in self-enhancement (favoring its own answers) and positional bias. We draw insights and lessons from the educational domain (Cho and MacArthur, 2011; Walsh, 2014) to improve LLM-based evaluations. Specifically, we propose the (1) peer rank (PR) algorithm that takes into account each peer LLM's pairwise preferences of all answer pairs, and outputs a final ranking of models; and (2) peer discussion (PD), where we prompt two LLMs to discuss and try to reach a mutual agreement on preferences of two answers. We conduct experiments on two benchmark datasets. We find that our approaches achieve higher accuracy and align better with human judgments, respectively. Interestingly, PR can induce a relatively accurate self-ranking of models under the anonymous setting, where each model's name is unrevealed. Our work provides space to explore evaluating models that are hard to compare for humans.
翻译:摘要:当前,不同现代大语言模型生成的回答质量难以自动评估与比较。近期研究倾向于采用无参考度量方法,特别是使用公认“最强”的大语言模型作为评估器,对候选模型的回答进行成对比较并给出排序分数。然而,这种直观方法存在多重问题,例如引入自我增强偏好(倾向于自身回答)和位置偏差。我们从教育领域(Cho and MacArthur, 2011; Walsh, 2014)中汲取见解与经验,以改进基于大语言模型的评估方法。具体而言,我们提出:(1)同行排名算法,该算法综合考虑各同行大语言模型对所有答案对的成对偏好,输出模型的最终排名;(2)同行讨论方法,即通过提示两个大语言模型进行讨论,力求在两组回答的偏好上达成共识。我们在两个基准数据集上开展实验,发现我们的方法分别实现了更高的准确率,且与人类判断的一致性更优。有趣的是,在匿名设置下(即不揭示各模型名称),同行排名算法能诱导出模型的相对准确的自排序结果。本研究为探索评估人类难以直接比较的模型提供了新空间。