Human evaluation of generated language through pairwise preference judgments is pervasive. However, under common scenarios, such as when generations from a model pair are very similar, or when stochastic decoding results in large variations in generations, it results in inconsistent preference ratings. We address these challenges by introducing a meta-evaluation measure, separability, which estimates how suitable a test instance is for pairwise preference evaluation. For a candidate test instance, separability samples multiple generations from a pair of models, and measures how distinguishable the two sets of generations are. Our experiments show that instances with high separability values yield more consistent preference ratings from both human- and auto-raters. Further, the distribution of separability allows insights into which test benchmarks are more valuable for comparing models. Finally, we incorporate separability into ELO ratings, accounting for how suitable each test instance might be for reliably ranking LLMs. Overall, separability has implications for consistent, efficient and robust preference evaluation of LLMs with both human- and auto-raters.
翻译:通过成对偏好判断对生成语言进行人工评估的做法十分普遍。然而,在常见场景下——例如当模型对的生成结果极为相似,或随机解码导致生成结果存在巨大差异时——这种方法会导致不一致的偏好评分。我们通过引入元评估指标“可分离性”来解决这些挑战,该指标用于评估测试实例是否适合进行成对偏好评估。对于候选测试实例,可分离性会从一对模型中采样多个生成结果,并度量两组生成结果的可区分程度。实验表明,具有高可分离性值的实例能产生更一致的人类评分者和自动评分者偏好评分。此外,可分离性的分布有助于深入理解哪些测试基准对于模型比较更具价值。最后,我们将可分离性纳入ELO评分体系,以考量每个测试实例对于可靠排序大语言模型的适用性。总体而言,可分离性对于使用人类和自动评分者实现一致、高效且稳健的大语言模型偏好评估具有重要意义。