CAMMSR: Category-Guided Attentive Mixture of Experts for Multimodal Sequential Recommendation

The explosion of multimedia data in information-rich environments has intensified the challenges of personalized content discovery, positioning recommendation systems as an essential form of passive data management. Multimodal sequential recommendation, which leverages diverse item information such as text and images, has shown great promise in enriching item representations and deepening the understanding of user interests. However, most existing models rely on heuristic fusion strategies that fail to capture the dynamic and context-sensitive nature of user-modal interactions. In real-world scenarios, user preferences for modalities vary not only across individuals but also within the same user across different items or categories. Moreover, the synergistic effects between modalities-where combined signals trigger user interest in ways isolated modalities cannot-remain largely underexplored. To this end, we propose CAMMSR, a Category-guided Attentive Mixture of Experts model for Multimodal Sequential Recommendation. At its core, CAMMSR introduces a category-guided attentive mixture of experts (CAMoE) module, which learns specialized item representations from multiple perspectives and explicitly models inter-modal synergies. This component dynamically allocates modality weights guided by an auxiliary category prediction task, enabling adaptive fusion of multimodal signals. Additionally, we design a modality swap contrastive learning task to enhance cross-modal representation alignment through sequence-level augmentation. Extensive experiments on four public datasets demonstrate that CAMMSR consistently outperforms state-of-the-art baselines, validating its effectiveness in achieving adaptive, synergistic, and user-centric multimodal sequential recommendation.

翻译：在信息富集环境中，多媒体数据的爆炸式增长加剧了个性化内容发现的挑战，使得推荐系统成为被动数据管理的重要形式。多模态序列推荐通过利用文本和图像等多样化商品信息，在丰富商品表征和深化用户兴趣理解方面展现出巨大潜力。然而，现有模型大多依赖启发式融合策略，未能捕捉用户-模态交互的动态性和上下文敏感性。在现实场景中，用户对模态的偏好不仅因人而异，同一用户在不同商品或类别间也存在差异。此外，模态间的协同效应——即组合信号能以孤立模态无法实现的方式激发用户兴趣——仍未得到充分探索。为此，我们提出CAMMSR，一种面向多模态序列推荐的类别引导注意力专家混合模型。其核心是类别引导注意力专家混合模块，该模块从多视角学习专业化商品表征，并显式建模模态间协同效应。该组件通过辅助类别预测任务引导模态权重动态分配，实现多模态信号的自适应融合。此外，我们设计了模态交换对比学习任务，通过序列级增强提升跨模态表征对齐能力。在四个公开数据集上的大量实验表明，CAMMSR始终优于最先进的基线模型，验证了其在实现自适应、协同且以用户为中心的多模态序列推荐方面的有效性。