Taming the Long Tail: Denoising Collaborative Information for Robust Semantic ID Generation

Item IDs form the backbone of industrial recommender systems, but suffer from representation instability and poor long-tail generalization in large, dynamic item corpora. Semantic IDs (SIDs) mitigate these issues by enabling knowledge sharing through quantization of item content features. Existing methods attempt to enhance SID expressiveness by incorporating collaborative information with content features; however, they often overlook a critical distinction: unlike relatively uniform content features, user-item interactions are highly skewed, resulting in a significant quality gap in collaborative information between popular and long-tail items. This mismatch leads to two critical limitations: (1) Collaborative Noise Corrupts Behavior-Content Alignment: Behavior-content alignment is a prevailing approach for modeling shared information. However, indiscriminate alignment allows collaborative noise from long-tail items to corrupt their content representations, leading to the loss of critical multimodal information. (2) Collaborative Noise Obscures Critical Behavioral SIDs: When modeling modality-specific information, prior works typically generate multiple behavioral SIDs with equal weights for each item. This equal-weight scheme fails to reflect the varying importance of different behavioral SIDs, making it difficult for downstream tasks to distinguish informative SIDs from noisy ones. To address these challenges, we propose ADC-SID, a framework that Adaptively Denoises Collaborative information for SID quantization. It comprises two key components: (i) Adaptive Behavior-Content Alignment, which adjusts alignment strength to mitigate corruption caused by collaborative noise; and (ii) Dynamic Behavioral Weighting Mechanism, which learns importance scores for behavioral SIDs to enable downstream models to suppress noise. Extensive experiments has demonstrated ADC-SID's superiority...

翻译：项目ID构成了工业推荐系统的支柱，但在庞大且动态的项目语料库中，存在表示不稳定和长尾泛化能力差的问题。语义ID通过量化项目内容特征来实现知识共享，从而缓解了这些问题。现有方法试图通过将协同信息与内容特征结合来增强语义ID的表达能力；然而，它们往往忽视了一个关键区别：与相对统一的内容特征不同，用户-项目交互是高度倾斜的，导致热门项目和长尾项目之间的协同信息存在显著的质量差距。这种不匹配导致了两个关键限制：(1) 协同噪声破坏行为-内容对齐：行为-内容对齐是建模共享信息的流行方法。然而，不加区分的对齐使得来自长尾项目的协同噪声污染了其内容表示，导致关键的多模态信息丢失。(2) 协同噪声掩盖关键行为语义ID：在建模模态特定信息时，先前的工作通常为每个项目生成多个具有相等权重的行为语义ID。这种等权重方案无法反映不同行为语义ID的不同重要性，使得下游任务难以区分信息丰富的语义ID和噪声语义ID。为了解决这些挑战，我们提出了ADC-SID框架，该框架自适应地对协同信息进行去噪以用于语义ID量化。它包含两个关键组件：(i) 自适应行为-内容对齐，调整对齐强度以减轻协同噪声引起的污染；(ii) 动态行为加权机制，学习行为语义ID的重要性分数，使下游模型能够抑制噪声。大量实验证明了ADC-SID的优越性...