We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a transparent and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a straightforward averaging strategy or a principled graphical model-based truth inference algorithm to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. Our results across four datasets show that the two variants of the proposed approach outperform the advanced model Smoothie-Global by 6.9% and 7.3% points, cross diverse task types including factual recall QA, math reasoning, and instruction following.
翻译:我们提出了LLM-PeerReview,一种无监督的大语言模型集成方法,该方法通过汇聚具有不同优势的多个模型的集体智慧,为每个查询从多个大语言模型生成的候选回答中选择最理想的响应。LLM-PeerReview建立在一种新颖的、受同行评审启发的框架之上,该框架提供了透明且可解释的机制,同时保持完全无监督以实现灵活的适应性和泛化能力。具体而言,它分三个阶段运行:在评分阶段,我们利用新兴的LLM-as-a-Judge技术,通过复用现有的多个大语言模型来评估每个响应;在推理阶段,我们可以应用简单的平均策略或基于原则性图模型的真值推断算法来聚合多个评分,从而为每个响应生成最终分数;最后,选择得分最高的响应作为最佳集成输出。LLM-PeerReview在概念上简单,在实证上强大。我们在四个数据集上的结果表明,所提出方法的两种变体在包括事实回忆问答、数学推理和指令遵循在内的多种任务类型上,分别优于先进模型Smoothie-Global 6.9%和7.3%的百分点。