Large language models are often ranked according to their level of alignment with human preferences -- a model is better than other models if its outputs are more frequently preferred by humans. One of the most popular ways to elicit human preferences utilizes pairwise comparisons between the outputs provided by different models to the same inputs. However, since gathering pairwise comparisons by humans is costly and time-consuming, it has become a very common practice to gather pairwise comparisons by a strong large language model -- a model strongly aligned with human preferences. Surprisingly, practitioners cannot currently measure the uncertainty that any mismatch between human and model preferences may introduce in the constructed rankings. In this work, we develop a statistical framework to bridge this gap. Given a small set of pairwise comparisons by humans and a large set of pairwise comparisons by a model, our framework provides a rank-set -- a set of possible ranking positions -- for each of the models under comparison. Moreover, it guarantees that, with a probability greater than or equal to a user-specified value, the rank-sets cover the true ranking consistent with (the distribution of) human pairwise preferences. Our framework is computationally efficient, easy to use, and does not make any assumption about the distribution of human preferences nor about the degree of alignment between the pairwise comparisons by the humans and the strong large language model.
翻译:大型语言模型通常根据其与人类偏好的一致性来排名——如果一个模型输出的结果更常被人类偏好,则该模型优于其他模型。目前最流行的获取人类偏好的方式之一,是利用不同模型对相同输入产生的输出进行两两比较。然而,由于人工收集两两比较数据既昂贵又耗时,实践中普遍采用强大型语言模型(即高度对齐人类偏好的模型)来生成两两比较。令人惊讶的是,现有实践无法衡量人类偏好与模型偏好之间的差异在构建排名中引入的不确定性。本研究开发了一个统计框架来弥合这一鸿沟。给定少量人类两两比较数据和大量模型生成的两两比较数据,我们的框架为每个被比较模型提供了一套排名集(即可能的排名位置集合)。该框架保证,以大于等于用户指定值的概率,该排名集能够覆盖与人类两两偏好分布一致的“真实排名”。本框架计算高效、易于使用,且不对人类偏好分布或人类与强大型语言模型两两比较的对齐程度做任何假设。