PRE: A Peer Review Based Large Language Model Evaluator

The impressive performance of large language models (LLMs) has attracted considerable attention from the academic and industrial communities. Besides how to construct and train LLMs, how to effectively evaluate and compare the capacity of LLMs has also been well recognized as an important yet difficult problem. Existing paradigms rely on either human annotators or model-based evaluators to evaluate the performance of LLMs on different tasks. However, these paradigms often suffer from high cost, low generalizability, and inherited biases in practice, which make them incapable of supporting the sustainable development of LLMs in long term. In order to address these issues, inspired by the peer review systems widely used in academic publication process, we propose a novel framework that can automatically evaluate LLMs through a peer-review process. Specifically, for the evaluation of a specific task, we first construct a small qualification exam to select "reviewers" from a couple of powerful LLMs. Then, to actually evaluate the "submissions" written by different candidate LLMs, i.e., the evaluatees, we use the reviewer LLMs to rate or compare the submissions. The final ranking of evaluatee LLMs is generated based on the results provided by all reviewers. We conducted extensive experiments on text summarization tasks with eleven LLMs including GPT-4. The results demonstrate the existence of biasness when evaluating using a single LLM. Also, our PRE model outperforms all the baselines, illustrating the effectiveness of the peer review mechanism.

翻译：大语言模型（LLMs）的卓越性能已引起学术界和工业界的广泛关注。除了如何构建和训练大语言模型外，如何有效评估和比较其能力也被公认为重要但困难的问题。现有范式依赖人工标注员或基于模型的评估器来评价大语言模型在不同任务上的性能。然而，这些范式在实践中常面临成本高、泛化性低及固有偏见等问题，使其难以长期支撑大语言模型的可持续发展。为解决这些问题，受学术出版过程中广泛采用的同行评审系统启发，本文提出了一种可通过同行评审流程自动评估大语言模型的新框架。具体而言，针对特定任务的评估，我们首先构建小型资格测试，从多个强大LLMs中筛选"评审者"。接着，为实际评估不同候选大语言模型（即被评对象）生成的"提交内容"，我们利用评审LLMs对提交内容进行评分或比较。最终，基于所有评审者提供的结果生成被评LLMs的排名。我们在包含GPT-4在内的11个大语言模型的文本摘要任务上进行了广泛实验。结果表明，使用单一LLM进行评估存在偏见，而我们的PRE模型在所有基线方法中表现最优，验证了同行评审机制的有效性。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日