Large-scale commercial search systems optimize for relevance to drive successful sessions that help users find what they are looking for. To maximize relevance, we leverage two complementary objectives: behavioral relevance (results users tend to click or download) and textual relevance (a result's semantic fit to the query). A persistent challenge is the scarcity of expert-provided textual relevance labels relative to abundant behavioral relevance labels. We first address this by systematically evaluating LLM configurations, finding that a specialized, fine-tuned model significantly outperforms a much larger pre-trained one in providing highly relevant labels. Using this optimal model as a force multiplier, we generate millions of textual relevance labels to overcome the data scarcity. We show that augmenting our production ranker with these textual relevance labels leads to a significant outward shift of the Pareto frontier: offline NDCG improves for behavioral relevance while simultaneously increasing for textual relevance. These offline gains were validated by a worldwide A/B test on the App Store ranker, which demonstrated a statistically significant +0.24% increase in conversion rate, with the most substantial performance gains occurring in tail queries, where the new textual relevance labels provide a robust signal in the absence of reliable behavioral relevance labels.
翻译:大规模商业搜索系统通过优化相关性来促进成功会话,帮助用户找到所需内容。为最大化相关性,我们采用了两个互补目标:行为相关性(用户倾向于点击或下载的结果)和文本相关性(结果与查询的语义匹配)。一个长期存在的挑战是,相较于丰富的相关性标签,专家提供的文本相关性标签十分稀缺。我们首先通过系统性评估大语言模型配置来解决这一问题,发现专门的微调模型在提供高度相关标签方面显著优于更大的预训练模型。利用该最优模型作为杠杆,我们生成了数百万个文本相关性标签以克服数据稀缺。研究表明,将这些文本相关性标签纳入生产级排序器后,帕累托前沿显著外移:离线NDCG在提升行为相关性的同时,文本相关性也同步提升。这些离线增益通过全球范围内的应用商店排序器A/B测试得到验证,结果显示转化率统计显著提升0.24%,其中性能提升最为显著的是尾部查询——在这些查询中,新的文本相关性标签在缺乏可靠行为相关性信号时提供了稳健支持。