We provide a systematic understanding of the impact of specific components and wordings used in prompts on the effectiveness of rankers based on zero-shot Large Language Models (LLMs). Several zero-shot ranking methods based on LLMs have recently been proposed. Among many aspects, methods differ across (1) the ranking algorithm they implement, e.g., pointwise vs. listwise, (2) the backbone LLMs used, e.g., GPT3.5 vs. FLAN-T5, (3) the components and wording used in prompts, e.g., the use or not of role-definition (role-playing) and the actual words used to express this. It is currently unclear whether performance differences are due to the underlying ranking algorithm, or because of spurious factors such as better choice of words used in prompts. This confusion risks to undermine future research. Through our large-scale experimentation and analysis, we find that ranking algorithms do contribute to differences between methods for zero-shot LLM ranking. However, so do the LLM backbones -- but even more importantly, the choice of prompt components and wordings affect the ranking. In fact, in our experiments, we find that, at times, these latter elements have more impact on the ranker's effectiveness than the actual ranking algorithms, and that differences among ranking methods become more blurred when prompt variations are considered.
翻译:本研究系统性地探讨了提示词中特定组件与措辞选择对基于零样本大语言模型(LLM)的排序器效果的影响。近期已提出多种基于LLM的零样本排序方法。这些方法在诸多方面存在差异,主要体现在:(1)所实现的排序算法,例如逐点排序与列表排序;(2)所采用的基础LLM,例如GPT3.5与FLAN-T5;(3)提示词中使用的组件与措辞,例如是否采用角色定义(角色扮演)及表达该定义的具体措辞。目前尚不清楚性能差异是源于底层的排序算法,还是由提示词中更优的措辞选择等表面因素所致。这种混淆可能对未来研究造成阻碍。通过大规模实验与分析,我们发现排序算法确实会导致零样本LLM排序方法间的差异。然而,基础LLM的选择同样产生影响——更重要的是,提示词组件与措辞的选择会显著影响排序效果。事实上,在我们的实验中,这些提示词相关因素有时对排序器效果的影响甚至超过具体的排序算法,且当考虑提示词变化时,不同排序方法间的差异会变得更加模糊。