LLM-as-a-judge approaches are a practical and effective way of assessing a range of text tasks, aligning with human judgements especially when applied in a comparative assessment fashion. However, when using pairwise comparisons to rank a set of candidates the computational costs scale quadratically with the number of candidates, which can have practical limitations. This paper introduces a Product of Expert (PoE) framework for efficient LLM Comparative Assessment. Here individual comparisons are considered experts that provide information on a pair's score difference. The PoE framework combines the information from these experts to yield an expression that can be maximized with respect to the underlying set of candidates, and is highly flexible where any form of expert can be assumed. When Gaussian experts are used one can derive simple closed-form solutions for the optimal candidate ranking, as well as expressions for selecting which comparisons should be made to maximize the probability of this ranking. Our approach enables efficient comparative assessment, where by using only a small subset of the possible comparisons, one can generate score predictions that correlate as well to human judgements as the predictions when all comparisons are used. We evaluate the approach on multiple NLG tasks and demonstrate that our framework can yield considerable computational savings when performing pairwise comparative assessment. When N is large, with as few as 2% of comparisons the PoE solution can achieve similar performance to when all comparisons are used.
翻译:LLM作为评判者的方法是一种实用且有效的文本任务评估方式,特别是在比较评估场景中,其结果与人类判断高度一致。然而,当使用成对比较对候选集进行排序时,计算成本会随候选数量呈二次方增长,这在实际应用中存在局限性。本文提出了一种基于专家乘积(PoE)框架的高效LLM比较评估方法。在该框架中,每个独立的比较被视为提供双方得分差异信息的专家,通过整合这些专家的信息得到目标表达式,并通过优化该表达式获得候选集的排序结果。该框架具有高度灵活性,可支持任意形式的专家假设。当采用高斯专家时,可推导出最优候选排序的简洁闭式解,以及通过最大化排序概率来选择最优比较对的表达式。我们的方法实现了高效比较评估:仅使用全部可能比较中的少量子集,就能生成与人类判断相关性相当的得分预测,其效果与使用全部比较时的预测结果几乎一致。我们在多个自然语言生成任务上对该方法进行了评估,结果表明该框架能在进行成对比较评估时大幅降低计算成本。当候选数量N较大时,仅需2%的比较次数,PoE解决方案即可达到与使用全部比较时相近的性能。