Large language models are often ranked according to their level of alignment with human preferences -- a model is better than other models if its outputs are more frequently preferred by humans. One of the popular ways to elicit human preferences utilizes pairwise comparisons between the outputs provided by different models to the same inputs. However, since gathering pairwise comparisons by humans is costly and time-consuming, it has become a common practice to gather pairwise comparisons by a strong large language model -- a model strongly aligned with human preferences. Surprisingly, practitioners cannot currently measure the uncertainty that any mismatch between human and model preferences may introduce in the constructed rankings. In this work, we develop a statistical framework to bridge this gap. Given a (small) set of pairwise comparisons by humans and a large set of pairwise comparisons by a model, our framework provides a rank-set -- a set of possible ranking positions -- for each of the models under comparison. Moreover, it guarantees that, with a probability greater than or equal to a user-specified value, the rank-sets cover the true ranking consistent with the distribution of human pairwise preferences asymptotically. Using pairwise comparisons made by humans in the LMSYS Chatbot Arena platform and pairwise comparisons made by three strong large language models, we empirically demonstrate the effectivity of our framework and show that the rank-sets constructed using only pairwise comparisons by the strong large language models are often inconsistent with (the distribution of) human pairwise preferences.
翻译:大型语言模型通常根据其与人类偏好的一致性程度进行排序——如果一个模型的输出更频繁地被人类偏好,则该模型优于其他模型。引出人类偏好的常用方法之一是利用不同模型对相同输入提供的输出之间的成对比较。然而,由于通过人类收集成对比较成本高昂且耗时,目前普遍采用通过与人类偏好高度对齐的强大大型语言模型来收集成对比较。令人惊讶的是,从业者目前无法衡量人类与模型偏好之间的任何不匹配可能对构建的排序引入的不确定性。在本研究中,我们开发了一个统计框架来弥合这一差距。给定一个(小规模的)人类成对比较集和一个大规模的模型成对比较集,我们的框架为每个被比较的模型提供一个排序集——即可能的排名位置集合。此外,该框架保证,在概率大于或等于用户指定值的情况下,这些排序集能够渐近地覆盖与人类成对偏好分布一致的真实排序。利用LMSYS Chatbot Arena平台上的人类成对比较以及三个强大大型语言模型的成对比较,我们通过实证证明了该框架的有效性,并表明仅使用强大大型语言模型的成对比较构建的排序集常常与人类成对偏好(的分布)不一致。