Hi-SAM: A Hierarchical Structure-Aware Multi-modal Framework for Large-Scale Recommendation

Multi-modal recommendation has gained traction as items possess rich attributes like text and images. Semantic ID-based approaches effectively discretize this information into compact tokens. However, two challenges persist: (1) Suboptimal Tokenization: existing methods (e.g., RQ-VAE) lack disentanglement between shared cross-modal semantics and modality-specific details, causing redundancy or collapse; (2) Architecture-Data Mismatch: vanilla Transformers treat semantic IDs as flat streams, ignoring the hierarchy of user interactions, items, and tokens. Expanding items into multiple tokens amplifies length and noise, biasing attention toward local details over holistic semantics. We propose Hi-SAM, a Hierarchical Structure-Aware Multi-modal framework with two designs: (1) Disentangled Semantic Tokenizer (DST): unifies modalities via geometry-aware alignment and quantizes them via a coarse-to-fine strategy. Shared codebooks distill consensus while modality-specific ones recover nuances from residuals, enforced by mutual information minimization; (2) Hierarchical Memory-Anchor Transformer (HMAT): splits positional encoding into inter- and intra-item subspaces via Hierarchical RoPE to restore hierarchy. It inserts Anchor Tokens to condense items into compact memory, retaining details for the current item while accessing history only through compressed summaries. Experiments on real-world datasets show consistent improvements over SOTA baselines, especially in cold-start scenarios. Deployed on a large-scale social platform serving millions of users, Hi-SAM achieved a 6.55% gain in the core online metric.

翻译：多模态推荐因物品具备文本和图像等丰富属性而受到关注。基于语义ID的方法能有效将这些信息离散化为紧凑的令牌。然而，两个挑战依然存在：(1) 次优的令牌化：现有方法（如RQ-VAE）未能解耦跨模态共享语义与模态特定细节，导致冗余或坍缩；(2) 架构-数据不匹配：原始Transformer将语义ID视为扁平序列，忽略了用户交互、物品和令牌的层次结构。将物品扩展为多个令牌会放大序列长度和噪声，使注意力偏向局部细节而非整体语义。我们提出Hi-SAM，一种分层结构感知多模态框架，包含两项设计：(1) 解耦语义令牌化器：通过几何感知对齐统一多模态信息，并采用由粗到细的策略进行量化。共享码本提炼共识语义，而模态特定码本从残差中恢复细节，并通过互信息最小化进行约束；(2) 分层记忆锚点Transformer：通过分层RoPE将位置编码拆分为物品间和物品内子空间以恢复层次结构。它插入锚点令牌将物品压缩为紧凑记忆，为当前物品保留细节，同时仅通过压缩摘要访问历史信息。在真实数据集上的实验表明，该框架相比现有最优基线模型取得了一致性提升，尤其在冷启动场景中。在服务数百万用户的大规模社交平台上部署后，Hi-SAM在核心在线指标上实现了6.55%的提升。