PRD：同行排名与讨论改进基于大语言模型的评估 (PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations)

Nowadays, the quality of responses generated by different modern large language models (LLMs) is hard to evaluate and compare automatically. Recent studies suggest and predominantly use LLMs for reference-free evaluation of open-ended question answering. More specifically, they use the recognized "strongest" LLM as the evaluator, which conducts pairwise comparisons of candidate models' answers and provides a ranking score. However, this intuitive method has multiple problems, such as bringing in self-enhancement (favoring its own answers) and positional bias. We draw insights and lessons from the educational domain (Cho & MacArthur, 2011; Walsh, 2014) to improve LLM-based evaluations. Specifically, we propose (1) the peer rank (PR) algorithm that takes into account each peer LLM's pairwise preferences of all answer pairs, and outputs a final ranking of models; and (2) peer discussion (PD), where we prompt two LLMs to discuss and try to reach a mutual agreement on the preferences of two answers. We conduct experiments on two benchmark datasets. We find that our approaches achieve higher accuracy and align better with human judgments. Interestingly, PR can induce a relatively accurate self-ranking of models under the anonymous setting, where each model's name is unrevealed. Our work provides space to explore evaluating models that are hard to compare for humans.

翻译：当前，不同现代大语言模型（LLM）生成的回答质量难以自动评估与比较。近期研究提出并主要采用LLM进行开放域问答的无参考评估。具体而言，这些研究使用公认的“最强”LLM作为评估器，对候选模型的答案进行两两比较并提供排序分数。然而，这种直观方法存在多个问题，例如引入自我增强（偏向自身答案）和位置偏差。我们从教育领域（Cho & MacArthur, 2011; Walsh, 2014）汲取见解与经验以改进基于LLM的评估。具体来说，我们提出：（1）同行排名（PR）算法，该算法综合考虑每个同行LLM对所有答案对的成对偏好，并输出模型的最终排序；（2）同行讨论（PD），即提示两个LLM进行讨论，尝试就两个答案的偏好达成共识。我们在两个基准数据集上开展实验，发现我们的方法获得了更高的准确率且与人类判断更吻合。有趣的是，在匿名设置（各模型名称被隐藏）下，PR能够引导出相对准确的模型自排序。本研究为探索评估人类难以比较的模型提供了新的空间。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日