Large language models (LLMs) exhibit positional bias in how they use context, which especially complicates listwise ranking. To address this, we propose permutation self-consistency, a form of self-consistency over ranking list outputs of black-box LLMs. Our key idea is to marginalize out different list orders in the prompt to produce an order-independent ranking with less positional bias. First, given some input prompt, we repeatedly shuffle the list in the prompt and pass it through the LLM while holding the instructions the same. Next, we aggregate the resulting sample of rankings by computing the central ranking closest in distance to all of them, marginalizing out prompt order biases in the process. Theoretically, we prove the robustness of our method, showing convergence to the true ranking in the presence of random perturbations. Empirically, on five list-ranking datasets in sorting and passage reranking, our approach improves scores from conventional inference by up to 7-18% for GPT-3.5 and 8-16% for LLaMA v2 (70B), surpassing the previous state of the art in passage reranking. Our code is at https://github.com/castorini/perm-sc.
翻译:大语言模型(LLMs)在利用上下文时存在位置偏差,这尤其使列表排序任务复杂化。为此,我们提出排列自一致性,这是一种针对黑盒LLM排序列表输出的自一致性形式。其核心思想是通过边缘化提示中的不同列表顺序,生成无位置偏差的顺序无关排序。具体而言,给定输入提示后,我们反复打乱提示中的列表顺序并输入LLM,同时保持指令不变;接着,通过计算与所有样本排序距离最小的中心排序来聚合所得排序样本,从而边缘化提示顺序偏差。理论上,我们证明了该方法的稳健性,表明其在随机扰动下能收敛至真实排序。实验上,在排序和段落重排序的五个列表排序数据集中,我们的方法使GPT-3.5和LLaMA v2(70B)的得分较常规推理分别提升7-18%和8-16%,超越了段落重排序的先前最优水平。代码见https://github.com/castorini/perm-sc。