Facet-Aware Multi-Head Mixture-of-Experts Model with Text-Enhanced Pre-training for Sequential Recommendation

Sequential recommendation (SR) systems excel at capturing users' dynamic preferences by leveraging their interaction histories. Most existing SR systems assign a single embedding vector to each item to represent its features, adopting various models to combine these embeddings into a sequence representation that captures user intent. However, we argue that this representation alone is insufficient to capture an item's multi-faceted nature (e.g., movie genres, starring actors). Furthermore, users often exhibit complex and varied preferences within these facets (e.g., liking both action and musical films within the genre facet), which are challenging to fully represent with static identifiers. To address these issues, we propose a novel architecture titled Facet-Aware Multi-Head Mixture-of-Experts Model for Sequential Recommendation (FAME). We leverage sub-embeddings from each head in the final multi-head attention layer to predict the next item separately, effectively capturing distinct item facets. A gating mechanism then integrates these predictions by dynamically determining their importance. Additionally, we introduce a Mixture-of-Experts (MoE) network within each attention head to disentangle varied user preferences within each facet, utilizing a learnable router network to aggregate expert outputs based on context. Complementing this architecture, we design a Text-Enhanced Facet-Aware Pre-training module to overcome the limitations of randomly initialized embeddings. By utilizing a pre-trained text encoder and employing an alternating supervised contrastive learning objective, we explicitly disentangle facet-specific features from textual metadata (e.g., descriptions) before sequential training begins. This ensures that the item embeddings are semantically robust and aligned with the downstream multi-facet framework.

翻译：序列推荐系统通过利用用户交互历史，擅长捕捉用户的动态偏好。现有大多数序列推荐系统为每个项目分配单一嵌入向量来表示其特征，并采用不同模型将这些嵌入组合成捕获用户意图的序列表示。然而，我们认为仅凭这种表示不足以捕捉项目的多面性（例如电影类型、主演演员）。此外，用户在这些维度上常表现出复杂多变的偏好（例如在类型维度上同时喜欢动作片和音乐片），这些偏好难以通过静态标识符完整表征。为解决这些问题，我们提出了一种新颖架构：面向序列推荐的多面感知多头混合专家模型。我们利用最终多头注意力层中每个头部的子嵌入分别预测下一项目，有效捕捉不同的项目维度。随后通过门控机制动态确定各预测的重要性并进行整合。此外，我们在每个注意力头内部引入混合专家网络，以解耦每个维度内多样的用户偏好，利用可学习的路由网络根据上下文聚合专家输出。作为该架构的补充，我们设计了文本增强的多面感知预训练模块，以克服随机初始化嵌入的局限性。通过使用预训练文本编码器并采用交替监督对比学习目标，我们在序列训练开始前从文本元数据中显式解耦出维度特定特征，确保项目嵌入具有语义鲁棒性并与下游多面框架保持对齐。