The impressive performance of large language models (LLMs) has attracted considerable attention from the academic and industrial communities. Besides how to construct and train LLMs, how to effectively evaluate and compare the capacity of LLMs has also been well recognized as an important yet difficult problem. Existing paradigms rely on either human annotators or model-based evaluators to evaluate the performance of LLMs on different tasks. However, these paradigms often suffer from high cost, low generalizability, and inherited biases in practice, which make them incapable of supporting the sustainable development of LLMs in long term. In order to address these issues, inspired by the peer review systems widely used in academic publication process, we propose a novel framework that can automatically evaluate LLMs through a peer-review process. Specifically, for the evaluation of a specific task, we first construct a small qualification exam to select "reviewers" from a couple of powerful LLMs. Then, to actually evaluate the "submissions" written by different candidate LLMs, i.e., the evaluatees, we use the reviewer LLMs to rate or compare the submissions. The final ranking of evaluatee LLMs is generated based on the results provided by all reviewers. We conducted extensive experiments on text summarization tasks with eleven LLMs including GPT-4. The results demonstrate the existence of biasness when evaluating using a single LLM. Also, our PRE model outperforms all the baselines, illustrating the effectiveness of the peer review mechanism.
翻译:大型语言模型(LLM)的卓越性能已引起学术界和工业界的广泛关注。除了如何构建和训练LLM之外,如何有效评估和比较LLM的能力也已被公认为一个重要且具有挑战性的问题。现有范式主要依赖人工标注者或基于模型的评估器来评估LLM在不同任务上的表现。然而,这些范式在实践中常面临成本高昂、泛化能力不足以及存在固有偏差等问题,使其难以长期支持LLM的可持续发展。为解决这些问题,受学术出版过程中广泛采用的同行评审系统启发,我们提出了一种新颖框架,能够通过同行评审流程自动评估LLM。具体而言,针对特定任务的评估,我们首先设计小型资格测试,从若干高性能LLM中筛选出“评审员”。随后,为实际评估由不同候选LLM(即被评估对象)生成的“投稿”,我们使用评审员LLM对投稿进行评分或比较。被评估LLM的最终排名基于所有评审员提供的结果生成。我们在文本摘要任务上对包括GPT-4在内的十一个LLM进行了广泛实验。结果表明,使用单一LLM进行评估时存在偏差。同时,我们的PRE模型在所有基线方法中表现最优,验证了同行评审机制的有效性。