Training good representations for items is critical in recommender models. Typically, an item is assigned a unique randomly generated ID, and is commonly represented by learning an embedding corresponding to the value of the random ID. Although widely used, this approach have limitations when the number of items are large and items are power-law distributed -- typical characteristics of real-world recommendation systems. This leads to the item cold-start problem, where the model is unable to make reliable inferences for tail and previously unseen items. Removing these ID features and their learned embeddings altogether to combat cold-start issue severely degrades the recommendation quality. Content-based item embeddings are more reliable, but they are expensive to store and use, particularly for users' past item interaction sequence. In this paper, we use Semantic IDs, a compact discrete item representations learned from content embeddings using RQ-VAE that captures hierarchy of concepts in items. We showcase how we use them as a replacement of item IDs in a resource-constrained ranking model used in an industrial-scale video sharing platform. Moreover, we show how Semantic IDs improves the generalization ability of our system, without sacrificing top-level metrics.
翻译:为物品训练良好表示是推荐模型中的关键环节。通常,每个物品被分配一个唯一的随机生成ID,并通过学习对应于该随机ID值的嵌入来表示。尽管这种方法被广泛使用,但当物品数量庞大且呈现幂律分布(现实推荐系统的典型特征)时,存在局限性。这导致了物品冷启动问题,模型无法对长尾物品和未见物品做出可靠推断。完全移除这些ID特征及其学习到的嵌入以应对冷启动问题,会严重损害推荐质量。基于内容的物品嵌入更可靠,但其存储和使用成本较高,尤其是在处理用户历史交互序列时。本文采用语义ID——一种通过RQ-VAE从内容嵌入中学习到的紧凑离散物品表示,能够捕捉物品中的概念层次结构。我们展示了如何将其用作资源受限排序模型中物品ID的替代方案,该模型应用于工业级视频分享平台。此外,我们证明语义ID在无需牺牲顶层指标的情况下,提升了系统的泛化能力。