Stop Treating Collisions Equally: Qualification-Aware Semantic ID Learning for Recommendation at Industrial Scale

Zheng Hu,Yuxin Chen,Yongsen Pan,Xu Yuan,Yuting Yin,Daoyuan Wang,Boyang Xia,Zefei Luo,Hongyang Wang,Songhao Ni,Dongxu Liang,Jun Wang,Shimin Cai,Tao Zhou,Fuji Ren,Wenwu Ou

Semantic IDs (SIDs) are compact discrete representations derived from multimodal item features, serving as a unified abstraction for ID-based and generative recommendation. However, learning high-quality SIDs remains challenging due to two issues. (1) Collision problem: the quantized token space is prone to collisions, in which semantically distinct items are assigned identical or overly similar SID compositions, resulting in semantic entanglement. (2) Collision-signal heterogeneity: collisions are not uniformly harmful. Some reflect genuine conflicts between semantically unrelated items, while others stem from benign redundancy or systematic data effects. To address these challenges, we propose Qualification-Aware Semantic ID Learning (QuaSID), an end-to-end framework that learns collision-qualified SIDs by selectively repelling qualified conflict pairs and scaling the repulsion strength by collision severity. QuaSID consists of two mechanisms: Hamming-guided Margin Repulsion, which translates low-Hamming SID overlaps into explicit, severity-scaled geometric constraints on the encoder space; and Conflict-Aware Valid Pair Masking, which masks protocol-induced benign overlaps to denoise repulsion supervision. In addition, QuaSID incorporates a dual-tower contrastive objective to inject collaborative signals into tokenization. Experiments on public benchmarks and industrial data validate QuaSID. On public datasets, QuaSID consistently outperforms strong baselines, improving top-K ranking quality by 5.9% over the best baseline while increasing SID composition diversity. In an online A/B test on Kuaishou e-commerce with a 5% traffic split, QuaSID increases ranking GMV-S2 by 2.38% and improves completed orders on cold-start retrieval by up to 6.42%. Finally, we show that the proposed repulsion loss is plug-and-play and enhances a range of SID learning frameworks across datasets.

翻译：语义ID（SID）是从多模态物品特征中提取的紧凑离散表示，可作为基于ID和生成式推荐的统一抽象。然而，由于两个问题，学习高质量的SID仍然具有挑战性。(1) 冲突问题：量化后的标记空间容易发生冲突，即语义不同的物品被分配相同或过度相似的SID组合，导致语义纠缠。(2) 冲突信号异质性：冲突并非一律有害。有些反映了语义无关物品之间的真实冲突，而另一些则源于良性冗余或系统性数据效应。为应对这些挑战，我们提出了资格感知语义ID学习（QuaSID），这是一个端到端框架，通过有选择地排斥合格的冲突对，并根据冲突严重程度缩放排斥强度，来学习具备冲突资格判定的SID。QuaSID包含两种机制：汉明距离引导的边缘排斥，将低汉明距离的SID重叠转化为编码器空间上显式的、按严重程度缩放的几何约束；以及冲突感知的有效对掩码，用于掩蔽协议诱导的良性重叠以去除排斥监督中的噪声。此外，QuaSID结合了双塔对比学习目标，将协同信号注入到标记化过程中。在公共基准和工业数据上的实验验证了QuaSID的有效性。在公共数据集上，QuaSID始终优于强基线模型，在提高SID组合多样性的同时，将Top-K排序质量较最佳基线提升了5.9%。在快手电商平台上进行的在线A/B测试（5%流量分割）中，QuaSID使排序GMV-S2提升了2.38%，并将冷启动检索的完成订单量最高提升了6.42%。最后，我们证明了所提出的排斥损失具有即插即用特性，能够增强跨数据集的一系列SID学习框架。