Large language models can accurately predict searcher preferences

Relevance labels, which indicate whether a search result is valuable to a searcher, are key to evaluating and optimising search systems. The best way to capture the true preferences of users is to ask them for their careful feedback on which results would be useful, but this approach does not scale to produce a large number of labels. Getting relevance labels at scale is usually done with third-party labellers, who judge on behalf of the user, but there is a risk of low-quality data if the labeller doesn't understand user needs. To improve quality, one standard approach is to study real users through interviews, user studies and direct feedback, find areas where labels are systematically disagreeing with users, then educate labellers about user needs through judging guidelines, training and monitoring. This paper introduces an alternate approach for improving label quality. It takes careful feedback from real users, which by definition is the highest-quality first-party gold data that can be derived, and develops an large language model prompt that agrees with that data. We present ideas and observations from deploying language models for large-scale relevance labelling at Bing, and illustrate with data from TREC. We have found large language models can be effective, with accuracy as good as human labellers and similar capability to pick the hardest queries, best runs, and best groups. Systematic changes to the prompts make a difference in accuracy, but so too do simple paraphrases. To measure agreement with real searchers needs high-quality ``gold'' labels, but with these we find that models produce better labels than third-party workers, for a fraction of the cost, and these labels let us train notably better rankers.

翻译：相关性标签，即指示搜索结果对搜索者是否有价值的标签，是评估和优化搜索系统的关键。捕捉用户真实偏好的最佳方式是询问他们对哪些结果有用这一详细反馈，但这种方法难以大规模生成大量标签。通常，大规模获取相关性标签需借助第三方标注员，他们代表用户进行判断，但如果标注员不理解用户需求，则存在数据质量低下的风险。为提高质量，一种标准方法是通过访谈、用户研究和直接反馈来研究真实用户，找出标签系统性地与用户意见不一致的领域，然后通过评判指南、培训和监控来教育标注员了解用户需求。本文介绍了一种提高标签质量的替代方法。它收集来自真实用户的细致反馈——这按定义是可得最高质量的第一方黄金数据——并开发出与该数据一致的大型语言模型提示。我们展示了在Bing上部署语言模型进行大规模相关性标注的见解和观察，并用TREC的数据进行说明。我们发现大型语言模型能够有效运作，准确度与人工标注员相当，且在选择最难查询、最佳运行及最佳组等方面具备类似能力。系统性地更改提示会影响准确度，但简单的释义也同样如此。与真实搜索者达成一致需要高质量的“黄金”标签，但利用这些标签，我们发现模型能以更低的成本生成比第三方工作人员更优的标签，而这些标签能让我们训练出显著更好的排序器。