Prediction-Powered Ranking of Large Language Models

Large language models are often ranked according to their level of alignment with human preferences -- a model is better than other models if its outputs are more frequently preferred by humans. One of the popular ways to elicit human preferences utilizes pairwise comparisons between the outputs provided by different models to the same inputs. However, since gathering pairwise comparisons by humans is costly and time-consuming, it has become a common practice to gather pairwise comparisons by a strong large language model -- a model strongly aligned with human preferences. Surprisingly, practitioners cannot currently measure the uncertainty that any mismatch between human and model preferences may introduce in the constructed rankings. In this work, we develop a statistical framework to bridge this gap. Given a (small) set of pairwise comparisons by humans and a large set of pairwise comparisons by a model, our framework provides a rank-set -- a set of possible ranking positions -- for each of the models under comparison. Moreover, it guarantees that, with a probability greater than or equal to a user-specified value, the rank-sets cover the true ranking consistent with the distribution of human pairwise preferences asymptotically. Using pairwise comparisons made by humans in the LMSYS Chatbot Arena platform and pairwise comparisons made by three strong large language models, we empirically demonstrate the effectivity of our framework and show that the rank-sets constructed using only pairwise comparisons by the strong large language models are often inconsistent with (the distribution of) human pairwise preferences.

翻译：大型语言模型通常根据其与人类偏好的对齐程度进行排序——如果一个模型的输出更频繁地被人类偏好，则该模型优于其他模型。引出人类偏好的常用方法之一是利用不同模型对相同输入所提供输出的两两比较。然而，由于通过人类收集两两比较成本高昂且耗时，目前普遍的做法是通过一个强大的人类偏好对齐大型语言模型来收集两两比较数据。令人惊讶的是，从业者目前无法衡量人类与模型偏好之间的任何不匹配在构建的排序中可能引入的不确定性。在本工作中，我们开发了一个统计框架来弥补这一差距。给定一个（小规模的）人类两两比较数据集和一个大规模的模型两两比较数据集，我们的框架为每个参与比较的模型提供一个秩集——即可能的排序位置集合。此外，该框架保证在渐近意义上，以大于等于用户指定值的概率，这些秩集能够覆盖与人类两两偏好分布一致的真实排序。利用LMSYS Chatbot Arena平台中人类进行的两两比较以及三个强大大型语言模型进行的两两比较，我们通过实证验证了框架的有效性，并证明仅使用强大大型语言模型两两比较构建的秩集常常与人类两两偏好（分布）不一致。