Industrial recommender systems rely on unique Item Identifiers (ItemIDs). However, this method struggles with scalability and generalization in large, dynamic datasets that have sparse long-tail data. Content-based Semantic IDs (SIDs) address this by sharing knowledge through content quantization. However, by ignoring dynamic behavioral properties, purely content-based SIDs have limited expressive power. Existing methods attempt to incorporate behavioral information but overlook a critical distinction: unlike relatively uniform content features, user-item interactions are highly skewed and diverse, creating a vast information gap in quality and quantity between popular and long-tail items. This oversight leads to two critical limitations: (1) Noise Corruption: Indiscriminate behavior-content alignment allows collaborative noise from long-tail items to corrupt their content representations, leading to the loss of critical multimodal information. (2)Signal Obscurity: The equal-weighting scheme for SIDs fails to reflect the varying importance of different behavioral signals, making it difficult for downstream tasks to distinguish important SIDs from uninformative ones. To tackle these issues, we propose a mixture-of-quantization framework, MMQ-v2, to adaptively Align, Denoise, and Amplify multimodal information from content and behavior modalities for semantic IDs learning. The semantic IDs generated by this framework named ADA-SID. It introduces two innovations: an adaptive behavior-content alignment that is aware of information richness to shield representations from noise, and a dynamic behavioral router to amplify critical signals by applying different weights to SIDs. Extensive experiments on public and large-scale industrial datasets demonstrate ADA-SID's significant superiority in both generative and discriminative recommendation tasks.
翻译:工业推荐系统依赖于唯一的物品标识符(ItemIDs)。然而,在具有稀疏长尾数据的大规模动态数据集中,该方法难以实现可扩展性和泛化性。基于内容的语义ID(SIDs)通过内容量化共享知识来解决这一问题。然而,由于忽略了动态行为特性,纯内容驱动的SIDs表达能力有限。现有方法尝试融入行为信息,但忽视了一个关键区别:与相对均匀的内容特征不同,用户-物品交互具有高度偏态性和多样性,导致热门物品与长尾物品在信息质量和数量上存在巨大差距。这一疏忽引发了两个关键局限:(1)噪声污染:无差别的行为-内容对齐使得来自长尾物品的协同噪声污染其内容表示,导致关键多模态信息丢失。(2)信号模糊性:SIDs的等权重分配方案无法反映不同行为信号的重要性差异,使下游任务难以区分重要SIDs与非信息性SIDs。为解决这些问题,我们提出一种混合量化框架MMQ-v2,通过自适应地对齐、去噪和增强来自内容与行为模态的多模态信息,以学习语义ID。该框架生成的语义ID称为ADA-SID。它引入两项创新:基于信息丰富度的自适应行为-内容对齐机制以保护表示免受噪声干扰,以及动态行为路由器通过对SIDs施加不同权重来增强关键信号。在公开数据集和大规模工业数据集上的大量实验表明,ADA-SID在生成式和判别式推荐任务中均具有显著优越性。