Generative recommendation systems have achieved significant advances by leveraging semantic IDs to represent items. However, existing approaches that tokenize each modality independently face two critical limitations: (1) redundancy across modalities that reduces efficiency, and (2) failure to capture inter-modal interactions that limits item representation. We introduce FusID, a modality-fused semantic ID framework that addresses these limitations through three key components: (i) multimodal fusion that learns unified representations by jointly encoding information across modalities, (ii) representation learning that brings frequently co-occurring item embeddings closer while maintaining distinctiveness and preventing feature redundancy, and (iii) product quantization that converts the fused continuous embeddings into multiple discrete tokens to mitigate ID conflict. Evaluated on a multimodal next-song recommendation (i.e., playlist continuation) benchmark, FusID achieves zero ID conflicts, ensuring that each token sequence maps to exactly one song, mitigates codebook underutilization, and outperforms baselines in terms of MRR and Recall@k (k = 1, 5, 10, 20).
翻译:生成式推荐系统通过利用语义标识来表示物品已取得显著进展。然而,现有独立对每个模态进行标记化的方法面临两个关键局限:(1) 跨模态冗余降低了效率,(2) 未能捕捉模态间交互从而限制了物品表示。我们提出FusID,一种多模态融合语义标识框架,通过三个关键组件解决这些局限:(i) 多模态融合,通过跨模态联合编码信息学习统一表示;(ii) 表示学习,使频繁共现的物品嵌入更接近,同时保持区分性并防止特征冗余;(iii) 乘积量化,将融合的连续嵌入转换为多个离散标记以缓解标识冲突。在多模态下一首歌曲推荐(即播放列表延续)基准上的评估表明,FusID实现了零标识冲突,确保每个标记序列精确映射到一首歌曲,缓解了码本利用不足问题,并在MRR和Recall@k(k = 1, 5, 10, 20)指标上优于基线方法。