Large-scale commercial search systems optimize for relevance to drive successful sessions that help users find what they are looking for. To maximize relevance, we leverage two complementary objectives: behavioral relevance (results users tend to click or download) and textual relevance (a result's semantic fit to the query). A persistent challenge is the scarcity of expert-provided textual relevance labels relative to abundant behavioral relevance labels. We first address this by systematically evaluating LLM configurations, finding that a specialized, fine-tuned model significantly outperforms a much larger pre-trained one in providing highly relevant labels. Using this optimal model as a force multiplier, we generate millions of textual relevance labels to overcome the data scarcity. We show that augmenting our production ranker with these textual relevance labels leads to a significant outward shift of the Pareto frontier: offline NDCG improves for behavioral relevance while simultaneously increasing for textual relevance. These offline gains were validated by a worldwide A/B test on the App Store ranker, which demonstrated a statistically significant +0.24% increase in conversion rate, with the most substantial performance gains occurring in tail queries, where the new textual relevance labels provide a robust signal in the absence of reliable behavioral relevance labels.
翻译:大规模商业搜索系统通过优化相关性来推动成功会话,帮助用户找到所需内容。为最大化相关性,我们利用两个互补目标:行为相关性(用户倾向于点击或下载的结果)和文本相关性(结果与查询的语义匹配度)。长期存在的挑战在于专家提供的文本相关性标签相对稀缺,而行为相关性标签则较为丰富。我们首先通过系统评估LLM配置来解决此问题,发现专门微调的模型在提供高相关性标签方面显著优于规模更大的预训练模型。利用该最优模型作为效能倍增器,我们生成了数百万文本相关性标签以克服数据稀缺问题。研究表明,将这些文本相关性标签融入生产排序器后,帕累托前沿显著外移:离线NDCG在行为相关性指标提升的同时,文本相关性指标也同步提高。这些离线收益通过App Store排序器的全球A/B测试得到验证,转化率实现统计显著的+0.24%增长,其中性能提升最显著的部分出现在长尾查询场景——在这些缺乏可靠行为相关性标签的情况下,新增的文本相关性标签提供了稳健的信号。