Randomly-hashed item ids are used ubiquitously in recommendation models. However, the learned representations from random hashing prevents generalization across similar items, causing problems of learning unseen and long-tail items, especially when item corpus is large, power-law distributed, and evolving dynamically. In this paper, we propose using content-derived features as a replacement for random ids. We show that simply replacing ID features with content-based embeddings can cause a drop in quality due to reduced memorization capability. To strike a good balance of memorization and generalization, we propose to use Semantic IDs -- a compact discrete item representation learned from frozen content embeddings using RQ-VAE that captures the hierarchy of concepts in items -- as a replacement for random item ids. Similar to content embeddings, the compactness of Semantic IDs poses a problem of easy adaption in recommendation models. We propose novel methods for adapting Semantic IDs in industry-scale ranking models, through hashing sub-pieces of of the Semantic-ID sequences. In particular, we find that the SentencePiece model that is commonly used in LLM tokenization outperforms manually crafted pieces such as N-grams. To the end, we evaluate our approaches in a real-world ranking model for YouTube recommendations. Our experiments demonstrate that Semantic IDs can replace the direct use of video IDs by improving the generalization ability on new and long-tail item slices without sacrificing overall model quality.
翻译:随机哈希的物品ID在推荐模型中普遍使用。然而,从随机哈希学习到的表征阻碍了相似物品间的泛化,导致学习未见物品和长尾物品时出现困难,尤其在物品库规模大、呈幂律分布且动态演化的场景中。本文提出使用内容衍生特征替代随机ID。我们发现,若直接将ID特征替换为基于内容的嵌入表示,会因记忆能力下降而导致质量降低。为在记忆与泛化间取得良好平衡,我们提出使用语义ID——一种通过RQ-VAE从冻结的内容嵌入中学习得到的紧凑离散物品表征,能够捕捉物品概念的层次结构——作为随机物品ID的替代方案。与内容嵌入类似,语义ID的紧凑性在推荐模型中带来了适配难题。我们提出了在工业级排序模型中适配语义ID的新方法,通过对语义ID序列的子片段进行哈希处理来实现。特别地,我们发现LLM分词中常用的SentencePiece模型优于手动设计的片段(如N-gram)。最终,我们在YouTube推荐的真实排序模型中评估了所提方法。实验表明,语义ID能够替代直接使用视频ID,在不牺牲整体模型质量的前提下,提升对新物品和长尾物品片段的泛化能力。