评分、推理与择优：通过同行评审过程集成大语言模型 (Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process)

Zhijun Chen,Zeyu Ji,Qianren Mao,Junhang Cheng,Bangjie Qin,Hao Wu,Zhuoran Li,Jingzheng Li,Kai Sun,Zizhe Wang,Yikun Ban,Zhu Sun,Xiangyang Ji,Hailong Sun

We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a clear and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a principled graphical model-based truth inference algorithm or a straightforward averaging strategy to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. The two variants of the proposed approach obtain strong results across four datasets, including outperforming the recent advanced model Smoothie-Global by 6.9% and 7.3% points, respectively.

翻译：我们提出了LLM-PeerReview，一种无监督的大语言模型集成方法，该方法通过利用具有不同优势的多个模型的集体智慧，从每个查询对应的多个大语言模型生成的候选响应中选择最理想的答案。LLM-PeerReview建立在一个新颖的、受同行评审启发的框架之上，该框架提供了一个清晰且可解释的机制，同时保持完全无监督以实现灵活的适应性和泛化能力。具体而言，它分三个阶段运行：在评分阶段，我们利用新兴的"LLM-as-a-Judge"技术，通过复用现有的多个大语言模型来评估每个响应；在推理阶段，我们可以应用一种基于原则的图模型真值推断算法，或者采用一种简单的平均策略，来聚合多个评分并为每个响应生成一个最终分数；最后，选择得分最高的响应作为最佳的集成输出。LLM-PeerReview在概念上简单，在实证上强大。所提方法的两个变体在四个数据集上均取得了优异的结果，分别以6.9%和7.3%的百分点超越了近期先进模型Smoothie-Global。