Retrieving relevant items that match users' queries from billion-scale corpus forms the core of industrial e-commerce search systems, in which embedding-based retrieval (EBR) methods are prevailing. These methods adopt a two-tower framework to learn embedding vectors for query and item separately and thus leverage efficient approximate nearest neighbor (ANN) search to retrieve relevant items. However, existing EBR methods usually ignore inconsistent user behaviors in industrial multi-stage search systems, resulting in insufficient retrieval efficiency with a low commercial return. To tackle this challenge, we propose to improve EBR methods by learning Multi-level Multi-Grained Semantic Embeddings(MMSE). We propose the multi-stage information mining to exploit the ordered, clicked, unclicked and random sampled items in practical user behavior data, and then capture query-item similarity via a post-fusion strategy. We then propose multi-grained learning objectives that integrate the retrieval loss with global comparison ability and the ranking loss with local comparison ability to generate semantic embeddings. Both experiments on a real-world billion-scale dataset and online A/B tests verify the effectiveness of MMSE in achieving significant performance improvements on metrics such as offline recall and online conversion rate (CVR).
翻译:从十亿级语料库中检索与用户查询相匹配的相关商品,构成了工业电商搜索系统的核心,其中基于嵌入的检索(EBR)方法占据主导地位。这些方法采用双塔框架分别学习查询和商品的嵌入向量,从而利用高效近似最近邻(ANN)搜索来检索相关商品。然而,现有的EBR方法通常忽略了工业多阶段搜索系统中用户行为的不一致性,导致检索效率不足且商业回报较低。针对这一挑战,我们提出通过学习多级多粒度语义嵌入(MMSE)来改进EBR方法。我们提出多阶段信息挖掘,以利用实际用户行为数据中的排序、点击、未点击和随机采样商品,并通过后融合策略捕捉查询与商品之间的相似度。随后,我们提出多粒度学习目标,将具有全局比较能力的检索损失与具有局部比较能力的排序损失相结合,以生成语义嵌入。基于真实十亿级数据集的实验和在线A/B测试均验证了MMSE在离线召回率和在线转化率(CVR)等指标上实现显著性能提升的有效性。