Better Generalization with Semantic IDs: A Case Study in Ranking for Recommendations

Anima Singh,Trung Vu,Nikhil Mehta,Raghunandan Keshavan,Maheswaran Sathiamoorthy,Yilin Zheng,Lichan Hong,Lukasz Heldt,Li Wei,Devansh Tandon,Ed H. Chi,Xinyang Yi

Randomly-hashed item ids are used ubiquitously in recommendation models. However, the learned representations from random hashing prevents generalization across similar items, causing problems of learning unseen and long-tail items, especially when item corpus is large, power-law distributed, and evolving dynamically. In this paper, we propose using content-derived features as a replacement for random ids. We show that simply replacing ID features with content-based embeddings can cause a drop in quality due to reduced memorization capability. To strike a good balance of memorization and generalization, we propose to use Semantic IDs -- a compact discrete item representation learned from frozen content embeddings using RQ-VAE that captures the hierarchy of concepts in items -- as a replacement for random item ids. Similar to content embeddings, the compactness of Semantic IDs poses a problem of easy adaption in recommendation models. We propose novel methods for adapting Semantic IDs in industry-scale ranking models, through hashing sub-pieces of of the Semantic-ID sequences. In particular, we find that the SentencePiece model that is commonly used in LLM tokenization outperforms manually crafted pieces such as N-grams. To the end, we evaluate our approaches in a real-world ranking model for YouTube recommendations. Our experiments demonstrate that Semantic IDs can replace the direct use of video IDs by improving the generalization ability on new and long-tail item slices without sacrificing overall model quality.

翻译：随机哈希的物品ID在推荐模型中普遍使用。然而，从随机哈希学习到的表征阻碍了相似物品间的泛化，导致学习未见物品和长尾物品时出现困难，尤其在物品库规模大、呈幂律分布且动态演化的场景中。本文提出使用内容衍生特征替代随机ID。我们发现，若直接将ID特征替换为基于内容的嵌入表示，会因记忆能力下降而导致质量降低。为在记忆与泛化间取得良好平衡，我们提出使用语义ID——一种通过RQ-VAE从冻结的内容嵌入中学习得到的紧凑离散物品表征，能够捕捉物品概念的层次结构——作为随机物品ID的替代方案。与内容嵌入类似，语义ID的紧凑性在推荐模型中带来了适配难题。我们提出了在工业级排序模型中适配语义ID的新方法，通过对语义ID序列的子片段进行哈希处理来实现。特别地，我们发现LLM分词中常用的SentencePiece模型优于手动设计的片段（如N-gram）。最终，我们在YouTube推荐的真实排序模型中评估了所提方法。实验表明，语义ID能够替代直接使用视频ID，在不牺牲整体模型质量的前提下，提升对新物品和长尾物品片段的泛化能力。