Synthetic Data Powers Product Retrieval for Long-tail Knowledge-Intensive Queries in E-commerce Search

Product retrieval is the backbone of e-commerce search: for each user query, it identifies a high-recall candidate set from billions of items, laying the foundation for high-quality ranking and user experience. Despite extensive optimization for mainstream queries, existing systems still struggle with long-tail queries, especially knowledge-intensive ones. These queries exhibit diverse linguistic patterns, often lack explicit purchase intent, and require domain-specific knowledge reasoning for accurate interpretation. They also suffer from a shortage of reliable behavioral logs, which makes such queries a persistent challenge for retrieval optimization. To address these issues, we propose an efficient data synthesis framework tailored to retrieval involving long-tail, knowledge-intensive queries. The key idea is to implicitly distill the capabilities of a powerful offline query-rewriting model into an efficient online retrieval system. Leveraging the strong language understanding of LLMs, we train a multi-candidate query rewriting model with multiple reward signals and capture its rewriting capability in well-curated query-product pairs through a powerful offline retrieval pipeline. This design mitigates distributional shift in rewritten queries, which might otherwise limit incremental recall or introduce irrelevant products. Experiments demonstrate that without any additional tricks, simply incorporating this synthetic data into retrieval model training leads to significant improvements. Online Side-By-Side (SBS) human evaluation results indicate a notable enhancement in user search experience.

翻译：产品检索是电商搜索的支柱：针对每个用户查询，它从数十亿商品中识别出高召回率的候选集，为高质量的排序和用户体验奠定基础。尽管针对主流查询进行了广泛优化，现有系统在处理长尾查询（尤其是知识密集型查询）时仍面临挑战。这些查询呈现出多样化的语言模式，通常缺乏明确的购买意图，并需要领域特定的知识推理才能准确理解。此外，它们还面临可靠行为日志短缺的问题，使得此类查询成为检索优化的长期难题。为解决这些问题，我们提出了一种高效的数据合成框架，专门针对涉及长尾、知识密集型查询的检索场景。其核心思想是将强大的离线查询重写模型的能力隐式地蒸馏到高效的在线检索系统中。利用大语言模型强大的语言理解能力，我们通过多种奖励信号训练了一个多候选项查询重写模型，并通过强大的离线检索管线将其重写能力捕获到精心整理的查询-商品对中。这种设计减轻了重写查询的分布偏移问题，否则该偏移可能限制增量召回或引入无关商品。实验表明，无需任何额外技巧，仅将合成数据纳入检索模型训练即可带来显著改进。在线Side-By-Side人工评估结果表明，用户搜索体验得到了显著提升。