The exponential growth of online content has posed significant challenges to ID-based models in industrial recommendation systems, ranging from extremely high cardinality and dynamically growing ID space, to highly skewed engagement distributions, to prediction instability as a result of natural id life cycles (e.g, the birth of new IDs and retirement of old IDs). To address these issues, many systems rely on random hashing to handle the id space and control the corresponding model parameters (i.e embedding table). However, this approach introduces data pollution from multiple ids sharing the same embedding, leading to degraded model performance and embedding representation instability. This paper examines these challenges and introduces Semantic ID prefix ngram, a novel token parameterization technique that significantly improves the performance of the original Semantic ID. Semantic ID prefix ngram creates semantically meaningful collisions by hierarchically clustering items based on their content embeddings, as opposed to random assignments. Through extensive experimentation, we demonstrate that Semantic ID prefix ngram not only addresses embedding instability but also significantly improves tail id modeling, reduces overfitting, and mitigates representation shifts. We further highlight the advantages of Semantic ID prefix ngram in attention-based models that contextualize user histories, showing substantial performance improvements. We also report our experience of integrating Semantic ID into Meta production Ads Ranking system, leading to notable performance gains and enhanced prediction stability in live deployments.
翻译:在线内容的指数级增长对工业推荐系统中基于ID的模型提出了重大挑战,包括极高的基数性与动态增长的ID空间、高度倾斜的参与度分布,以及由ID自然生命周期(如新ID产生与旧ID淘汰)导致的预测不稳定性。为解决这些问题,许多系统依赖随机哈希处理ID空间并控制相应模型参数(即嵌入表)。然而,该方法因多个ID共享相同嵌入而引入数据污染,导致模型性能下降与嵌入表示不稳定。本文系统分析这些挑战,并提出语义ID前缀n元语法——一种创新的令牌参数化技术,能显著提升原始语义ID的性能。该技术通过基于内容嵌入对项目进行层次化聚类(而非随机分配)来创建语义层面的有效碰撞。大量实验表明,语义ID前缀n元语法不仅能解决嵌入不稳定性问题,还可显著改善尾部ID建模效果、减少过拟合现象并缓解表示偏移。我们进一步阐明了该技术在基于注意力的用户历史上下文模型中的优势,展现出显著的性能提升。文中同时报告了将语义ID集成至Meta广告排序生产系统的实践经验,该部署在实际运行中取得了显著的性能增益与预测稳定性提升。