LLM-as-a-judge approaches are a practical and effective way of assessing a range of text tasks. However, when using pairwise comparisons to rank a set of candidates, the computational cost scales quadratically with the number of candidates, which has practical limitations. This paper introduces a Product of Expert (PoE) framework for efficient LLM Comparative Assessment. Here individual comparisons are considered experts that provide information on a pair's score difference. The PoE framework combines the information from these experts to yield an expression that can be maximized with respect to the underlying set of candidates, and is highly flexible where any form of expert can be assumed. When Gaussian experts are used one can derive simple closed-form solutions for the optimal candidate ranking, and expressions for selecting which comparisons should be made to maximize the probability of this ranking. Our approach enables efficient comparative assessment, where by using only a small subset of the possible comparisons, one can generate score predictions that correlate well with human judgements. We evaluate the approach on multiple NLG tasks and demonstrate that our framework can yield considerable computational savings when performing pairwise comparative assessment. With many candidate texts, using as few as 2% of comparisons the PoE solution can achieve similar performance to when all comparisons are used.
翻译:大语言模型作为评判者的方法是一种实用且有效的文本任务评估方式。然而,当使用成对比较来对一组候选文本进行排序时,计算成本随候选文本数量呈二次方增长,这在实际应用中存在限制。本文提出了一种用于高效大语言模型比较评估的专家乘积框架。在该框架中,单个比较被视为提供一对文本得分差异信息的专家。PoE框架整合来自这些专家的信息,导出一个可针对底层候选文本集进行最大化的表达式,并且具有高度灵活性,可以假设任何形式的专家。当使用高斯专家时,可以推导出最优候选排序的简单闭式解,以及选择应进行哪些比较以最大化该排序概率的表达式。我们的方法实现了高效的比较评估,仅使用可能比较的一小部分子集,即可生成与人类判断高度相关的得分预测。我们在多个自然语言生成任务上评估了该方法,并证明我们的框架在执行成对比校评估时能够显著节省计算成本。当候选文本数量众多时,仅使用2%的比较,PoE解决方案即可达到与使用全部比较时相近的性能。