Despite rapid progress of continuous embeddings for e-commerce search relevance, a long-standing open problem is the difficulty in capturing fine-grained attribute distinctions. While discrete Semantic Identifiers (SIDs) have been widely adopted as a promising alternative, existing SID generation methods rely heavily on unsupervised quantization. In realistic scenarios, the lack of explicit supervision often makes it more difficult to dictate which items should share an SID, resulting in limited capability for query-dependent ranking. To address the issue of unsupervised SIDs, we propose to explicitly model discrete relevance features and develop a Discrete Semantic Identifier Relevance Model (DSIRM). Specifically, we present a query-bridged contrastive quantization approach on the item side, injecting query-item interaction supervision into Residual Quantization to actively learn relevance-aware semantic partitions. On the other hand, we explore generative LLMs on the query side to explicitly predict item SIDs from text, resolving tail queries and intent ambiguity. Hierarchical prefix matching between query and item SIDs yields discriminative features that perfectly complement dense signals. Extensive experimental results on Tmall's production data show that our proposed approach has achieved better results, improving offline AUC by +1.54\%. Deployed via an efficient hybrid architecture, it achieves significant online lifts (+0.13\% UCTR, +0.25\% UCTCVR), proving its massive industrial value.
翻译:尽管连续嵌入在电商搜索相关性方面取得了快速进展,但长期存在的开放难题是难以捕捉细粒度的属性差异。尽管离散语义标识符(SID)已被广泛采用为一种有前景的替代方案,现有SID生成方法严重依赖于无监督量化。在实际场景中,明确监督的缺失往往使得决定哪些商品应共享同一个SID更加困难,导致对查询依赖排序的能力有限。为解决无监督SID的问题,我们提出显式建模离散相关性特征,并开发了离散语义标识符相关性模型(DSIRM)。具体而言,我们在商品侧提出了一种查询桥接的对比量化方法,将查询-商品交互监督注入残差量化中,以主动学习相关性感知的语义分区。另一方面,我们在查询侧探索生成式大语言模型(LLM),从文本中显式预测商品SID,从而解决长尾查询和意图歧义问题。查询与商品SID之间的层次化前缀匹配产生了判别性特征,与密集信号完美互补。在天猫生产数据上的大量实验结果表明,我们提出的方法取得了更好的效果,离线AUC提升了+1.54%。通过高效的混合架构部署,该方法实现了显著的在线提升(UCTR +0.13%,UCTCVR +0.25%),证明了其巨大的工业价值。