The impressive performance of large language models (LLMs) has attracted considerable attention from the academic and industrial communities. Besides how to construct and train LLMs, how to effectively evaluate and compare the capacity of LLMs has also been well recognized as an important yet difficult problem. Existing paradigms rely on either human annotators or model-based evaluators to evaluate the performance of LLMs on different tasks. However, these paradigms often suffer from high cost, low generalizability, and inherited biases in practice, which make them incapable of supporting the sustainable development of LLMs in long term. In order to address these issues, inspired by the peer review systems widely used in academic publication process, we propose a novel framework that can automatically evaluate LLMs through a peer-review process. Specifically, for the evaluation of a specific task, we first construct a small qualification exam to select "reviewers" from a couple of powerful LLMs. Then, to actually evaluate the "submissions" written by different candidate LLMs, i.e., the evaluatees, we use the reviewer LLMs to rate or compare the submissions. The final ranking of evaluatee LLMs is generated based on the results provided by all reviewers. We conducted extensive experiments on text summarization tasks with eleven LLMs including GPT-4. The results demonstrate the existence of biasness when evaluating using a single LLM. Also, our PRE model outperforms all the baselines, illustrating the effectiveness of the peer review mechanism.
翻译:大语言模型(LLMs)的卓越性能已引起学术界和工业界的广泛关注。除了如何构建和训练大语言模型外,如何有效评估和比较其能力也被公认为重要但困难的问题。现有范式依赖人工标注员或基于模型的评估器来评价大语言模型在不同任务上的性能。然而,这些范式在实践中常面临成本高、泛化性低及固有偏见等问题,使其难以长期支撑大语言模型的可持续发展。为解决这些问题,受学术出版过程中广泛采用的同行评审系统启发,本文提出了一种可通过同行评审流程自动评估大语言模型的新框架。具体而言,针对特定任务的评估,我们首先构建小型资格测试,从多个强大LLMs中筛选"评审者"。接着,为实际评估不同候选大语言模型(即被评对象)生成的"提交内容",我们利用评审LLMs对提交内容进行评分或比较。最终,基于所有评审者提供的结果生成被评LLMs的排名。我们在包含GPT-4在内的11个大语言模型的文本摘要任务上进行了广泛实验。结果表明,使用单一LLM进行评估存在偏见,而我们的PRE模型在所有基线方法中表现最优,验证了同行评审机制的有效性。